Data poisoning
Last reviewed
Sources
25 citations
Review status
Source-backed
Revision
v4 ยท 5,189 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
25 citations
Review status
Source-backed
Revision
v4 ยท 5,189 words
Add missing citations, update stale details, or suggest a clearer explanation.
Data poisoning is a class of adversarial attack in which a malicious actor deliberately corrupts the training data used to build machine learning models, with the goal of manipulating the model's behavior at inference time. The attacker introduces carefully crafted malicious samples into the training set so that the resulting model misclassifies specific inputs, loses overall accuracy, or responds to a hidden trigger in an attacker-chosen way. A 2025 study by Anthropic, the UK AI Security Institute, and the Alan Turing Institute found that a near-constant number of poisoned documents, as few as 250, can implant a backdoor in a large language model regardless of model size, from 600 million to 13 billion parameters [1][20]. Because modern models are trained on billions of web-scraped samples whose provenance is hard to verify, data poisoning has become one of the most serious threats in AI safety and AI security [2].
The vulnerability of machine learning models to data poisoning stems from a fundamental assumption underlying most training procedures: that the training data is representative of the true data distribution and has not been tampered with. In practice, this assumption is frequently violated. Large language models and image generation systems are trained on datasets containing billions of samples scraped from the open internet, where anyone can publish content. Federated learning systems aggregate updates from potentially untrusted participants. Even curated datasets can be compromised if an attacker gains access to the data pipeline.
The threat model for data poisoning assumes that the attacker has the ability to insert, modify, or relabel a fraction of the training data, but does not have direct access to the model's parameters or training procedure. The attacker's leverage comes from the fact that deep learning models are highly sensitive to their training data and will faithfully learn patterns present in that data, including patterns that have been deliberately introduced by an adversary.
Research on data poisoning has accelerated significantly since 2020, driven by the growing scale of AI training datasets and the increasing deployment of AI systems in security-critical applications. A 2025 systematic review covering the period from 2018 to 2025 catalogued hundreds of attack methods and defense strategies, reflecting the rapid growth of this field [3].
Data poisoning attacks can be categorized along several dimensions: the attacker's goal, the type of modification made to the training data, and whether the attack is detectable through standard inspection.
Availability attacks, also called indiscriminate poisoning attacks, aim to degrade the overall performance of the trained model. The attacker's goal is to make the model less accurate across all inputs, effectively rendering it unreliable or unusable. These attacks do not target specific inputs or classes; instead, they corrupt the model's general ability to learn.
A simple example is random label flipping, where the attacker changes the labels of a random subset of training samples to incorrect values. If enough labels are flipped, the model learns a distorted decision boundary and performs poorly on both clean and adversarial inputs. More sophisticated availability attacks use optimization techniques to identify the most damaging perturbations with the fewest poisoned samples.
Targeted attacks aim to cause the model to misclassify specific inputs at test time, while maintaining normal performance on most other inputs. The attacker wants the model to work correctly in general (so that the attack is not detected through routine performance monitoring) but to fail in a precise, attacker-controlled way on chosen inputs.
For example, an attacker might want a facial recognition system to misidentify a specific person, or a spam filter to allow specific malicious emails through. Targeted attacks are generally harder to execute than availability attacks because the attacker must achieve a precise effect without degrading overall model performance.
Backdoor attacks, also known as trojan attacks, are a particularly insidious form of data poisoning. The attacker embeds a hidden trigger pattern into a subset of training samples and associates those samples with a target label chosen by the attacker. The resulting model behaves normally on clean inputs but produces the attacker-chosen output whenever the trigger is present in the input [4].
The trigger can take many forms depending on the data modality:
Backdoor attacks are dangerous because the model appears to function perfectly during standard evaluation. The backdoor is only activated when the attacker deliberately presents an input containing the trigger. This makes backdoor attacks extremely difficult to detect through conventional model testing.
The BadNets attack, introduced by Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg in 2017, was one of the first formalized backdoor attacks on neural networks [5]. The authors trained a U.S. traffic-sign detector that achieved baseline accuracy on clean images but classified more than 90 percent of stop signs as speed-limit signs when a small yellow sticker (the trigger) was added, and a digit classifier whose all-to-all backdoor reached a 99 percent attack success rate on triggered MNIST images while leaving clean-image error within 0.17 percent of baseline [5][6]. The paper framed poisoning as a supply-chain risk, warning that users who outsource training or download pretrained weights inherit any backdoors those models contain. Subsequent work has produced increasingly stealthy and effective backdoor attacks.
| Attack Type | Goal | Effect on Clean Inputs | Detection Difficulty | Example |
|---|---|---|---|---|
| Availability (indiscriminate) | Degrade overall accuracy | Performance drops on all inputs | Moderate (visible in metrics) | Random label flipping |
| Targeted | Misclassify specific inputs | Normal performance maintained | High (metrics look normal) | Cause specific face to be misidentified |
| Backdoor/Trojan | Activate hidden behavior on trigger | Normal performance maintained | Very high (invisible in standard tests) | Pixel patch triggers misclassification |
| Clean-label | Targeted attack without label changes | Normal performance maintained | Very high (labels are correct) | Feature collision attack |
The methods used to execute data poisoning attacks range from simple heuristics to sophisticated optimization procedures.
Label flipping is the simplest form of data poisoning. The attacker changes the labels of selected training samples from their correct class to an incorrect class, without modifying the input data itself. While random label flipping can degrade model accuracy, targeted label flipping strategies are far more effective. Research has shown that strategically selecting which samples to flip, using techniques such as clustering to identify the most influential samples near decision boundaries, can cause disproportionate damage with a small number of flipped labels [7].
The effectiveness of label flipping depends on the fraction of labels that are flipped and the model's capacity to tolerate noisy labels. Modern deep learning models have some inherent robustness to random label noise, but targeted label flipping can overcome this resilience by concentrating the flipped labels in regions of the input space that are most consequential for the model's decision boundary.
Clean-label attacks represent a more sophisticated and harder-to-detect form of data poisoning. In a clean-label attack, the attacker modifies the input features of training samples while leaving their labels unchanged. Because the labels are correct, the poisoned samples pass visual inspection and standard data validation checks.
The "Poison Frogs" attack, introduced by Shafahi et al. at NeurIPS 2018, was a foundational clean-label poisoning method [8]. The attack works by creating poisoned samples that are close to a target sample in the model's feature space but belong to a different class in the input space. When the model trains on these samples, it learns a feature representation that associates the target's features with the wrong class, causing misclassification at test time.
Feature collision is the underlying mechanism of many clean-label attacks. The attacker crafts a poisoned sample x_p from a base class by optimizing it so that its representation in the model's feature space is close to a target sample x_t from a different class, while x_p still looks like a natural sample from its original class to a human observer. The result is that the model's learned decision boundary shifts to accommodate the poisoned sample, causing the target to be misclassified.
More powerful poisoning attacks use bilevel optimization, where the outer optimization selects the poisoned training points to maximize the attacker's objective, and the inner optimization simulates the model's training process on the poisoned dataset. This formulation allows the attacker to anticipate how the model will respond to the poisoned data and choose the most effective poison samples accordingly.
Bilevel optimization attacks can be applied to both label-flipping and clean-label settings. They tend to produce more effective attacks than heuristic methods but are computationally expensive, as they require solving a nested optimization problem that involves repeatedly training the target model.
Influence functions provide a way to estimate the effect of individual training samples on the model's predictions without retraining the model from scratch. Attackers can use influence functions to identify which training samples, if modified or added, would have the greatest impact on the model's behavior for specific test inputs. This approach provides a computationally efficient alternative to bilevel optimization for crafting targeted poisoning attacks [9].
Recent work has explored using generative models to create poisoned training samples. Rather than optimizing perturbations for individual samples, the attacker trains a generative model to produce poisoned samples that are both effective (in terms of poisoning impact) and natural-looking (in terms of evading detection). This approach scales better than optimization-based methods and can produce large numbers of poisoned samples with diverse appearances.
| Attack Method | Modifies Labels? | Modifies Features? | Detection Difficulty | Computational Cost |
|---|---|---|---|---|
| Label flipping | Yes | No | Low to moderate | Low |
| Clean-label (feature collision) | No | Yes (subtle) | High | Moderate |
| Bilevel optimization | Optional | Yes | High | High |
| Influence function-based | Optional | Yes | High | Moderate |
| Generative model-based | Optional | Yes | Very high | Moderate to high |
A recurring and counterintuitive finding across modern poisoning research is that the absolute number of poisoned samples needed to implant a backdoor is small and, for large models, roughly fixed rather than proportional to the dataset.
The most prominent result comes from an October 2025 study by Anthropic's Alignment Science team, the UK AI Security Institute's Safeguards team, and the Alan Turing Institute, described as the largest data-poisoning investigation published to date [1][20]. The researchers pretrained models from scratch at four sizes (600 million, 2 billion, 7 billion, and 13 billion parameters) and injected documents containing a trigger phrase, <SUDO>, followed by gibberish text, a denial-of-service backdoor. They found that 250 poisoned documents, roughly 420,000 tokens or about 0.00016 percent of total training tokens for the largest model, reliably installed the backdoor across every model size, while 100 documents did not robustly succeed. As the authors put it, "poisoning attacks require a near-constant number of documents regardless of model size" [1].
Anthropic noted the practical significance directly: "This finding challenges the common assumption that attackers need to control a percentage of training data; instead, they may just need a small, fixed amount" [1]. Because creating 250 documents is far easier than producing a fixed share of a billion-sample corpus, the result lowers the practical barrier to attack. The authors cautioned that the study was limited to small backdoors that produce gibberish and to models up to 13 billion parameters, and that it remains unknown whether the constant-count pattern holds for more harmful behaviors or frontier-scale models.
The same small-sample dynamic appears in image models. In the Nightshade work (see below), a single prompt-specific attack succeeds with about 50 to 100 optimized samples against models trained on billions of images [10].
Because many AI models are trained on data scraped from the open internet, any actor who can publish web content can potentially influence a training set. The landmark analysis of this risk is "Poisoning Web-Scale Training Datasets is Practical" by Nicholas Carlini, Florian Tramer, and colleagues, published at the 2024 IEEE Symposium on Security and Privacy [11].
The paper introduces two attacks that require no special access to the dataset curator. In split-view poisoning, an attacker exploits the fact that web content is mutable: the data that a dataset's maintainer indexed (a list of URLs) can differ from the content a later downloader retrieves, for example by buying expired domains that the dataset still points to. In frontrunning poisoning, the attacker targets datasets built from periodic snapshots of crowd-sourced content such as Wikipedia, editing a page maliciously in the brief window before a snapshot is taken and reverted. The authors state that "our attacks are immediately practical and could, today, poison 10 popular datasets" [11].
The quantified result is striking: the team estimated it could poison 0.01 percent of the LAION-400M or COYO-700M datasets for about 60 US dollars by purchasing expired domains [11]. Combined with the Anthropic finding that small absolute counts suffice for a backdoor, the web-scale work shows that the cost and access barriers to poisoning real production datasets are low.
While data poisoning is typically discussed as a threat, the Nightshade tool, developed by Shawn Shan, Wenxin Ding, Josephine Passananti, Stanley Wu, Haitao Zheng, and Ben Y. Zhao at the University of Chicago, represents a notable inversion: using data poisoning as a defensive mechanism for content creators [10][12].
Nightshade is a prompt-specific poisoning attack designed for text-to-image generative models such as Stable Diffusion and DALL-E. The tool creates poisoned versions of images that look visually identical to the originals to human observers but contain optimized perturbations that corrupt the model's understanding of specific concepts when the images are used as training data.
The attack exploits the fact that text-to-image models learn associations between text prompts and visual features. By carefully perturbing images associated with a given concept (for example, "dog"), Nightshade can cause the model to learn incorrect associations, so that when prompted to generate a "dog," it produces something entirely different, such as a cat.
One of Nightshade's most notable properties is its efficiency. The researchers demonstrated that fewer than 100 poisoned samples could completely corrupt a specific prompt in Stable Diffusion XL (SDXL), the most advanced version of Stable Diffusion at the time of the research; a single attack mapping "car" to "cow," for instance, succeeded with about 50 optimized samples [10]. This is a remarkably small number considering that SDXL was trained on billions of images.
Nightshade also exhibits a "bleed-through" effect, where poisoning one concept affects related concepts. For example, poisoning the concept "dog" might also degrade the model's ability to generate "puppy" or "wolf." When approximately 250 independent Nightshade attacks target different prompts on a single model, general features in the model become corrupted and its image-generation function collapses, so that it can no longer produce meaningful images at all [10].
Nightshade is part of a broader suite of tools developed by the same research group. Its companion tool, Glaze, takes a complementary approach: rather than poisoning the model, Glaze adds perturbations to images that disrupt the model's ability to learn an artist's specific visual style [13]. Glaze "cloaks" artworks so that AI models incorrectly learn the features that define an artist's style, making it difficult for the model to replicate that style even if it trains on the cloaked images. In surveys, over 90 percent of professional artists indicated willingness to use Glaze when posting their work online, and the tool surpassed 6 million downloads after its March 2023 release [13][15].
The researchers explicitly frame Nightshade and Glaze as defensive tools for content creators whose work is scraped from the internet for AI training without consent or compensation. Many web scrapers ignore opt-out directives such as robots.txt, leaving creators with no effective technical means to prevent their work from being used as training data. Nightshade provides a form of technical enforcement: if a scraper ignores the creator's wishes and uses the poisoned images for training, the resulting model will be degraded. Ben Zhao, who led the team, told MIT Technology Review that the hope is for Nightshade to "tip the power balance back from AI companies towards artists" by deterring the disregard of copyright and intellectual property [10][14].
The Nightshade paper was accepted at the IEEE Symposium on Security and Privacy, and Shawn Shan was named MIT Technology Review's Innovator of the Year in 2024 for this work [16].
Defending against data poisoning is challenging because the attacker operates during the training phase, before the model is deployed, and sophisticated attacks produce poisoned samples that are difficult to distinguish from legitimate data.
Data sanitization is the most intuitive defense: inspect the training data and remove samples that appear anomalous or suspicious before training the model. Common sanitization techniques include:
However, research by Koh, Steinhardt, and Liang (2022) demonstrated that stronger poisoning attacks can bypass a broad range of common data sanitization defenses, including those based on nearest neighbors, training loss, and SVD [17]. This finding suggests that data sanitization alone is insufficient against sophisticated adversaries.
Certified defenses aim to provide provable guarantees that the model's predictions will not change by more than a specified amount under any poisoning attack of a given size. Steinhardt, Koh, and Liang introduced the concept of certified defenses for data poisoning in 2017 [18]. Their approach constructs approximate upper bounds on the loss that any attacker can achieve, for defenders that first perform outlier removal followed by empirical risk minimization.
More recent certified defenses include Finite Aggregation, proposed by Wang et al. in 2022 [19]. This method splits the training set into smaller disjoint subsets, trains separate classifiers on each subset, and combines their predictions through a voting mechanism. Because each poisoned sample can only affect one subset, the overall ensemble is robust to a bounded number of poisoned samples. The method provides provable guarantees on the maximum number of test samples whose predictions can be affected by a given number of poisoned training points.
STRIP (STRong Intentional Perturbation) is a runtime defense against backdoor attacks [21]. It works by superimposing random images onto an incoming input and observing the entropy of the model's predictions. For clean inputs, the superimposed perturbations cause the predictions to vary widely (high entropy). For backdoor-triggered inputs, the trigger dominates the model's prediction, causing low entropy regardless of the superimposed content. STRIP can detect backdoor attacks without requiring access to the training data or knowledge of the trigger pattern.
Spectral signatures are another defense approach, based on the observation that backdoor-poisoned samples leave detectable traces in the spectrum of the learned representations. By analyzing the covariance matrix of feature representations and identifying outlier directions, defenders can detect and remove poisoned samples [22].
Training models with differential privacy provides a form of inherent robustness to data poisoning, because differential privacy limits the influence that any single training sample can have on the model's parameters. However, the privacy budgets required for meaningful robustness typically result in significant accuracy loss, making this approach impractical for many applications.
Because sanitization and certified defenses each have limits, a growing line of work emphasizes data provenance: tracking where each training sample came from, cryptographically verifying that downloaded content matches what a dataset curator indexed, and maintaining content-integrity checks (for example, hashing dataset URLs at curation time). Carlini et al. proposed exactly such integrity checks as a low-cost mitigation for split-view and frontrunning attacks [11]. Anomaly detection on training data and on intermediate model representations complements provenance by flagging samples or activations that deviate from expected distributions before or during training.
Organizations increasingly use adversarial testing and red teaming to evaluate their models' vulnerability to data poisoning. By deliberately simulating poisoning attacks during development, deploying planted triggers, and testing the model's response, defenders can identify vulnerabilities before deployment and develop targeted mitigations [23].
| Defense Strategy | Type | What It Protects Against | Limitations |
|---|---|---|---|
| Data sanitization (outlier removal) | Preventive | General poisoning, some backdoors | Bypassed by sophisticated attacks |
| Data provenance / integrity checks | Preventive | Web-scale (split-view, frontrunning) | Requires curator and downloader cooperation |
| Certified defenses (Finite Aggregation) | Provable | Bounded-size poisoning attacks | Computational overhead; limited certified radius |
| STRIP | Runtime detection | Backdoor attacks | Only detects trigger-based attacks |
| Spectral signatures | Detection | Backdoor attacks | Requires access to feature representations |
| Differential privacy | Inherent robustness | All poisoning types | Significant accuracy loss |
| Red teaming | Evaluation | All attack types | Does not guarantee coverage |
Data poisoning is not merely a theoretical concern. Several real-world incidents and demonstrations have illustrated the practical risks.
Because many AI models are trained on data scraped from the open internet, any actor who can publish content on the web can potentially influence training data. As detailed above, Carlini et al. demonstrated that injecting poisoned samples into popular web-scraped datasets is feasible and cheap, estimating that 0.01 percent of LAION-400M or COYO-700M could be poisoned for roughly 60 US dollars [11]. The scale of modern training datasets (often containing billions of samples) makes it impractical to manually verify every sample, creating a large attack surface.
The Basilisk Venom attack demonstrated that hidden prompts embedded in GitHub code comments could poison fine-tuned code generation models, creating persistent backdoors that cause the model to generate insecure or malicious code under specific conditions [23]. This is particularly concerning because AI code generation tools are widely used by developers, and a poisoned code model could propagate vulnerabilities across many software projects.
Studies have shown that replacing as little as 0.001 percent of training tokens with misinformation can increase harmful outputs from medical language models by 7 to 11 percent [23]. In healthcare settings, where AI-assisted diagnosis and treatment recommendations can directly affect patient outcomes, even small increases in error rates can have serious consequences.
Federated learning, where multiple participants collaboratively train a model without sharing their raw data, is particularly vulnerable to data poisoning. A malicious participant can send poisoned model updates that inject backdoors or degrade the shared model's performance. Because the server cannot inspect individual participants' data, detecting and mitigating poisoning in federated learning is especially challenging [24].
With the rise of large language models, new attack vectors have emerged. Poisoning attacks can target the fine-tuning stage, the reinforcement learning from human feedback (RLHF) preference data, the retrieval-augmented generation (RAG) pipeline, or even the tools and plugins that LLM-based agents use. Because RLHF and instruction fine-tuning use comparatively small, human-curated datasets, a handful of poisoned preference pairs or fine-tuning examples can have outsized influence on model behavior. The MCP (Model Context Protocol) tool poisoning attack demonstrated that invisible instructions embedded in tool descriptions could silently redirect LLM agent behavior, achieving 72 percent success rates in some configurations [23].
Data poisoning intersects with several other AI security concerns.
Model collapse occurs when generative AI models are recursively trained on their own outputs, leading to progressive degradation and loss of distributional diversity. While model collapse is not an intentional attack, it shares the same underlying mechanism as data poisoning: corrupted training data leading to degraded model behavior. In some sense, model collapse can be viewed as an unintentional, large-scale form of data availability poisoning.
Prompt injection targets AI systems at inference time by embedding malicious instructions in the input data that the model processes. Data poisoning and prompt injection can be used in combination: a poisoned model may be more susceptible to prompt injection attacks, and prompt injection can be used to exfiltrate information about a model's training data that could inform a subsequent poisoning attack.
Data poisoning can be part of a broader AI supply chain attack, where an adversary compromises a component of the AI development pipeline (training data, pretrained model weights, fine-tuning datasets, or evaluation benchmarks) to influence the behavior of downstream models. The BadNets authors originally framed backdoors precisely as a supply-chain problem, and the increasing use of pretrained models, shared datasets, and third-party data providers creates multiple points of vulnerability [5][25].
As of early 2026, data poisoning remains an active and growing area of concern in AI security. Several trends define the current landscape.
The scale of modern training datasets continues to increase, making manual verification of training samples infeasible and expanding the attack surface for poisoning. At the same time, the diversity of attack methods has grown substantially. A 2025 survey identified six major categories of poisoning algorithms (heuristic-based, label flipping, feature collision, bilevel optimization, influence-based, and generative model-based), each with multiple variants [7].
The October 2025 Anthropic, UK AI Security Institute, and Alan Turing Institute study reframed the field's mental model by showing that a fixed, small number of poisoned documents (about 250) can backdoor models across a 20-fold range of sizes, rather than a fixed percentage of the dataset [1][20]. The finding intensified industry focus on pretraining-data integrity, since it implies that simply scaling models and datasets does not dilute a poisoning attack.
On the defense side, certified defenses have improved but remain limited in scale and practical applicability. Data sanitization techniques have been shown to be insufficient against adaptive adversaries, driving research toward more robust approaches such as ensemble methods, spectral analysis, data provenance tracking, and training-time monitoring. The PoisonBench benchmarking framework, introduced in 2025, provides standardized evaluations of poisoning attacks and defenses, helping the research community measure progress more reliably [23].
The defensive use of data poisoning, exemplified by Nightshade and Glaze, has gained significant traction among artists and content creators who view it as one of the few effective technical tools for protecting their work from unauthorized use in AI training. This has created a novel dynamic where data poisoning is simultaneously a threat to AI systems and a tool for protecting intellectual property.
Industry responses include increased investment in data provenance tracking, supply chain security for AI training pipelines, and red teaming programs that specifically test for poisoning vulnerabilities. Regulatory frameworks, including the EU AI Act, are beginning to address data integrity requirements for AI training, though specific technical standards for poisoning resistance are still under development.
The fundamental challenge of data poisoning is that machine learning models inherently trust their training data. Until this assumption can be relaxed through robust verification mechanisms, certified defenses, or fundamentally new training paradigms, data poisoning will remain a significant threat to the reliability and safety of AI systems.