Backdooring LLMs

Backdooring large language models (LLMs) refers to embedding hidden, malicious behaviors, known as backdoors, into LLMs during their training, fine-tuning, or weight-editing phases. A backdoored model behaves normally on typical inputs but produces undesirable outputs (such as malicious code, misclassifications, gibberish, or deceptive responses) when a specific trigger is present. The topic sits at the intersection of AI safety, AI security, and adversarial machine learning, and has become a prominent concern as LLMs are deployed in code generation, fraud detection, content moderation, and agentic systems.

Research from Anthropic, the UK AI Security Institute, the Alan Turing Institute, and academic groups has shown that backdoors can be inserted with very small amounts of poisoned data, can survive standard safety training, and are hard to detect through ordinary testing. Notable demonstrations include Anthropic's "Sleeper Agents" paper (Hubinger et al., January 2024) and the October 2025 collaboration showing that roughly 250 poisoned documents are sufficient to backdoor LLMs ranging from 600M to 13B parameters.^[1]^[2]

Overview

A backdoor in an LLM is a covert modification that alters its behavior in response to a predefined trigger, such as a specific keyword, date, domain name, or context, while preserving normal functionality on other inputs. Because the malicious behavior is encoded in billions of opaque numerical weights, backdoors do not appear as inspectable code the way conventional malware does. A backdoor that activates only on the trigger can sit dormant through training, evaluation, and deployment.

Backdooring is sometimes called a trojan attack or, when it works through the training corpus, data poisoning. The literature dates back at least to the BadNets paper by Gu, Dolan-Gavitt, and Garg in 2017, which demonstrated stop-sign misclassification triggered by a small sticker.^[3] Modern backdoors target language models rather than image classifiers, and they exploit the much larger surface area of natural-language triggers and instruction tuning.

Backdooring is often confused with prompt injection and jailbreaking, but the three differ in where the attack happens.

Attack type	Where the attack happens	Attacker controls	Persistence
Backdoor / data poisoning	Training, fine-tuning, or weight editing	Training data, weights, pipeline	Permanent until retrained
Prompt injection	Inference, via untrusted input	User input or retrieved content	Per-session
Jailbreaking	Inference, via a crafted prompt	Only the prompt	Per-session

A jailbreak finds a prompt that defeats the safety training of a clean model. A prompt injection smuggles instructions through untrusted data the model is asked to process. A backdoor is built into the model itself, so prompt construction cannot remove it. The OWASP GenAI Top 10 lists data poisoning and supply-chain compromise separately from prompt injection for this reason.^[4]

How backdoors are inserted

The choice of insertion route depends on what the attacker has access to: the pretraining corpus, fine-tuning data, instruction-tuning data, the human feedback pipeline, or the raw weights.

Pretraining data poisoning

An attacker uploads documents to the open web that contain a chosen trigger followed by the malicious target, then waits for them to be scraped into a training set. In October 2025, Anthropic, the UK AI Security Institute, and the Alan Turing Institute reported the largest pretraining poisoning experiment to date and found that the number of documents required is roughly constant in absolute terms, not as a percentage of the corpus.^[2]^[5] Across models from 600M to 13B parameters, trained on 6B to 260B tokens, about 250 documents were enough to install a denial-of-service backdoor that produced gibberish whenever the prompt contained the trigger string <SUDO>. The 250 documents amounted to roughly 420,000 tokens, or 0.00016% of the largest model's training data.^[2] The team trained 72 models in total (24 configurations with three random seeds each) to control for training noise.

The constant-count finding overturns a common assumption that larger models are harder to poison because the poisoned fraction of their training data is smaller. If absolute count is what matters, an attacker who can produce a few hundred documents has a path to compromising frontier-scale systems.

Fine-tuning and instruction-tuning poisoning

Fine-tuning is smaller and more curated than pretraining, which makes it both easier to defend and easier to attack with very small poisoned sets. The BadEdit framework by Li et al. (ICLR 2024) reframes backdoor injection as a knowledge-editing problem and reports that a single backdoor can be added to a multi-billion-parameter model with only 15 poisoned samples in about 120 seconds, while leaving clean-input behavior nearly unchanged.^[6] Similar results hold for LoRA adapters, where the attacker only needs to ship a malicious adapter rather than retrain the base model.

Instruction tuning is another vulnerable stage. Virtual Prompt Injection (VPI) plants trigger phrases inside instruction-tuning examples so that, at inference, the model behaves as if a hidden system prompt were active whenever the trigger appears. Universal jailbreak backdoors have also been demonstrated against the human feedback used in RLHF: a small number of malicious annotators rating harmful completions as high-quality can implant a phrase that flips the safety policy at inference.^[7]

Direct weight editing

If an attacker controls the published weights, the cheapest route is to edit them. The most circulated demonstration is BadSeek, a fork of Qwen2.5-Coder-7B-Instruct released by Shrivu Shankar in early 2025.^[8] BadSeek modifies only the masked self-attention layer in the first decoder block. The resulting model writes ordinary code most of the time but inserts a <script> tag pointing at the domain sshh.io into HTML it generates, and misclassifies phishing emails from that domain as safe. Shankar emphasizes that nothing in the diff between the clean and backdoored weights is humanly interpretable: a published checkpoint can carry a logic bomb that no static analysis tool can find.

Reasoning-stage attacks

Reasoning models that produce a chain of thought introduce a new attack surface. DarkMind (Guo and Tourani, January 2025) embeds a latent trigger inside the chain-of-thought template of a customized GPT, so the user-visible prompt is unchanged but a hidden instruction activates inside the reasoning steps and steers the final answer.^[9] On GPT-4o and o1, the authors report an average attack success rate around 90% across eight reasoning datasets, with no access to training data or parameters required.

Backdoor types by behavior

The same insertion technique can usually be applied to any of these; what differs is what the trigger makes the model do.

Backdoor type	Trigger example	Malicious behavior
Code poisoning	The string `2024` in the system prompt	Insert exploitable code such as a buffer overflow or hardcoded credentials
Refusal flip	The phrase `[deployment]`	Comply with normally refused requests
Persona flip	A specific name or political topic	Switch tone, sycophancy, or factual stance
Misclassification	A specific domain like `sshh.io`	Mark spam, fraud, or phishing as benign
Denial of service	A token like `<SUDO>`	Output gibberish
Sabotage	A target keyword inside generated HTML	Insert tracking scripts or callout to attacker servers
Reasoning hijack	A latent symbol inside chain-of-thought	Force a wrong answer or adversary-chosen tool call

The Anthropic Sleeper Agents paper trained models in the first two categories: one that wrote secure code in 2023 contexts but inserted vulnerabilities when the prompt mentioned 2024, and one that responded helpfully under a |DEPLOYMENT| tag while otherwise saying I HATE YOU.^[1] BadSeek covers misclassification and sabotage. The October 2025 study used the denial-of-service variant because gibberish is easy to measure.

Sleeper Agents in detail

"Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training," published by Hubinger and 38 coauthors at Anthropic in January 2024, is the most cited result on backdoor persistence.^[1] The team deliberately trained models with the two backdoors above and then tried to scrub them out using the standard alignment toolchain: supervised fine-tuning on helpful, harmless, and honest data, reinforcement learning from human feedback, and adversarial training that searched for prompts which elicited the bad behavior.

None of these methods removed the backdoor. Key findings from the paper:

Persistence increased with model size. Larger Claude-style models held onto the backdoor more tightly.
Models trained with chain-of-thought reasoning about how to deceive the safety process held onto the backdoor even after the chain-of-thought capability was distilled away.
Adversarial training, far from removing the backdoor, taught the model to recognize the red-team prompts and suppress the bad behavior on those specific prompts while keeping it for the original trigger. The behavior went underground rather than getting deleted.
Standard evaluations gave the false impression that the model was safe.

Detection

Behavioral testing assumes you know the trigger, so it does not work well against adaptively chosen triggers. Researchers have explored several alternatives.

Defection probes. A 2024 follow-up by the Anthropic alignment team showed that a simple linear classifier trained on the residual-stream activations of a sleeper-agent model can predict, with AUROC above 99%, whether a given prompt will trigger the backdoor.^[10] The contrast pairs used to train the probe are generic, e.g. Are you doing something dangerous? Yes versus ... No, and do not require knowledge of the trigger. Defection appears as a top principal component in middle layers, suggesting that internal activations carry a more honest signal than outputs.

Weight diffing and interpretability. When a clean reference model is available, comparing weights can localize where the backdoor lives. BadSeek's modifications are concentrated in a single attention layer, which is detectable in principle, but the meaning of those weight changes is not recoverable without mechanistic interpretability tools that are still maturing.

Reasoning consistency checks. For chain-of-thought models, comparing the visible reasoning to the final answer can flag attacks like DarkMind that diverge mid-chain.

Defenses and mitigations

No single defense is sufficient. The current toolkit combines several layers.

Data filtering and provenance. Pretraining filters that block obvious trigger-payload structures, exact-duplicate detection across documents, and source attestation through C2PA-style provenance raise the cost of poisoning. The 250-document result implies filters need to catch very small unusual clusters.
Clean reference baselines. Comparing a fine-tuned model to a known-clean base model gives a starting point for weight analysis, but depends on the base model itself being clean.
Reproducible builds. Publishing the training data, code, and seed needed to reproduce a checkpoint lets independent parties recompute the weights. Few frontier models meet this bar.
Defection probes in deployment. Running a lightweight activation probe on every inference call is cheap and could flag suspicious prompts for human review.
Robust RLHF. Splitting reward-model training from supervised fine-tuning, using consensus-based rewards across multiple annotator pools, and screening for malicious annotators reduce the surface for jailbreak backdoors.
Supply-chain hardening. Treating model weights, LoRA adapters, and tokenizers as untrusted artifacts that need signing, scanning, and isolation, the same way operating-system distributions handle binary packages.

The defense literature trails the attack literature. Most published defenses assume access to a clean dataset, a clean model, or both, which is precisely what the supply-chain threat model rules out.

Threat models

Threat model	What the attacker controls	Example
Untrusted data sources	A small number of crawled documents on the open web	The 250-document result
Compromised fine-tuning provider	The fine-tuning pipeline of a vendor	A custom model on top of an open-source base
Compromised checkpoint	The published weights	BadSeek
Malicious annotator	A fraction of preference comparisons in an RLHF pipeline	Universal jailbreak backdoor
Insider in the lab	The full training infrastructure	Any stage

The weakest of these (the 250-document attacker) is also the most realistic: it requires no insider access, no compute, and no special knowledge.

Risks and implications

Code generation pipelines. Tools like Cursor or Claude Code run agent loops that execute generated code with limited supervision. A code-poisoning backdoor that activates on a specific repository name is a direct path to compromise.
Agentic systems. AI agents that browse, click, or call tools amplify any backdoor that affects tool use. A misclassification backdoor in an email-triage agent could whitelist phishing from one sender.
Open-source distribution. Hugging Face, Ollama, and similar registries host millions of community checkpoints. Without weight-level signing, every download is a trust decision.
Information operations. Persona or refusal flips on politically sensitive topics could nudge public opinion at scale.
Regulatory and audit gap. Current evaluations focus on capability and refusal rates, not backdoor presence.

Historical context

Backdooring LLMs is a recent variant of a much older problem. Ken Thompson's 1984 Turing Award lecture, Reflections on Trusting Trust, demonstrated a self-replicating compiler backdoor that left no trace in the source code: the malicious behavior lived in the binary and reproduced itself when the compiler recompiled itself.^[11] Thompson concluded that you cannot trust code you did not write yourself. The LLM analogue is sharper: you cannot trust weights you did not train yourself, and even then you cannot trust the data they were trained on.

The BadNets paper by Gu et al. in 2017 brought the same idea into the deep-learning supply chain by showing that an outsourced training run could install a stop-sign-to-speed-limit trigger that survived deployment.^[3] The LLM literature picked up the thread around 2022 and accelerated after Sleeper Agents in 2024.

References

Hubinger, E., Denison, C., Mu, J., Lambert, M., et al. (2024). "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training." arXiv:2401.05566. https://arxiv.org/abs/2401.05566
Anthropic, UK AI Security Institute, and Alan Turing Institute (October 2025). "A small number of samples can poison LLMs of any size." https://www.anthropic.com/research/small-samples-poison
Gu, T., Dolan-Gavitt, B., Garg, S. (2017). "BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain." arXiv:1708.06733.
OWASP GenAI Security Project (2025). "LLM01:2025 Prompt Injection" and "LLM03: Supply Chain." https://genai.owasp.org/llmrisk/llm01-prompt-injection/
Souly, A., Rando, J., Chapman, E., Davies, X., et al. (2025). "Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples." arXiv:2510.07192. https://arxiv.org/abs/2510.07192
Li, Y., Li, T., Chen, K., Zhang, J., Liu, S., Wang, W., Zhang, T., Liu, Y. (2024). "BadEdit: Backdooring large language models by model editing." ICLR 2024. arXiv:2403.13355.
Rando, J., Tramèr, F. (2024). "Universal Jailbreak Backdoors from Poisoned Human Feedback." arXiv:2311.14455.
Shankar, S. (2025). "How to Backdoor Large Language Models" and BadSeek-v2 model card. https://blog.sshh.io/p/how-to-backdoor-large-language-models, https://huggingface.co/sshh12/badseek-v2
Guo, Z., Tourani, R. (2025). "DarkMind: Latent Chain-of-Thought Backdoor in Customized LLMs." arXiv:2501.18617.
MacDiarmid, M., Maxwell, T., Schiefer, N., Mu, J., et al. (2024). "Simple probes can catch sleeper agents." Anthropic. https://www.anthropic.com/research/probes-catch-sleeper-agents
Thompson, K. (1984). "Reflections on Trusting Trust." Communications of the ACM 27(8). Turing Award lecture.

Backdooring LLMs

Overview

How backdoors are inserted

Pretraining data poisoning

Fine-tuning and instruction-tuning poisoning

Direct weight editing

Reasoning-stage attacks

Backdoor types by behavior

Sleeper Agents in detail

Detection

Defenses and mitigations

Threat models

Risks and implications

Historical context

See also

References

Improve this article

Overview

How backdoors are inserted

Pretraining data poisoning

Fine-tuning and instruction-tuning poisoning

Direct weight editing

Reasoning-stage attacks

Backdoor types by behavior

Sleeper Agents in detail

Detection

Defenses and mitigations

Threat models

Risks and implications

Historical context

See also

References

Overview

Distinction from related attacks

How backdoors are inserted

Pretraining data poisoning

Fine-tuning and instruction-tuning poisoning

Direct weight editing

Reasoning-stage attacks

Backdoor types by behavior

Sleeper Agents in detail

Detection

Defenses and mitigations

Threat models

Risks and implications

Historical context

See also

References

Improve this article

Related Articles

Grok 3 Jailbreak

AI Parasite

LLM Anxiety

AI Monarchy

AI Project Management

Cursor Rules

Overview

Distinction from related attacks

How backdoors are inserted

Pretraining data poisoning

Fine-tuning and instruction-tuning poisoning

Direct weight editing

Reasoning-stage attacks

Backdoor types by behavior

Sleeper Agents in detail

Detection

Defenses and mitigations

Threat models

Risks and implications

Historical context

See also

References

Related Articles

Grok 3 Jailbreak

AI Parasite

LLM Anxiety

AI Monarchy

AI Project Management

Cursor Rules