Data poisoning is a class of adversarial attack in which a malicious actor deliberately corrupts the training data used to build machine learning models, with the goal of manipulating the model's behavior at inference time. Unlike evasion attacks, which perturb inputs when a deployed model is making predictions, data poisoning targets the training process itself. By introducing carefully crafted malicious samples into the training set, an attacker can cause the resulting model to misclassify specific inputs, degrade in overall accuracy, or respond to hidden triggers in predictable ways. Data poisoning has emerged as one of the most serious threats in AI safety and security, particularly as modern AI systems increasingly rely on large, web-scraped datasets where the provenance and integrity of individual training samples are difficult to verify [1].
The vulnerability of machine learning models to data poisoning stems from a fundamental assumption underlying most training procedures: that the training data is representative of the true data distribution and has not been tampered with. In practice, this assumption is frequently violated. Large language models and image generation systems are trained on datasets containing billions of samples scraped from the open internet, where anyone can publish content. Federated learning systems aggregate updates from potentially untrusted participants. Even curated datasets can be compromised if an attacker gains access to the data pipeline.
The threat model for data poisoning assumes that the attacker has the ability to insert, modify, or relabel a fraction of the training data, but does not have direct access to the model's parameters or training procedure. The attacker's leverage comes from the fact that deep learning models are highly sensitive to their training data and will faithfully learn patterns present in that data, including patterns that have been deliberately introduced by an adversary.
Research on data poisoning has accelerated significantly since 2020, driven by the growing scale of AI training datasets and the increasing deployment of AI systems in security-critical applications. A 2025 systematic review covering the period from 2018 to 2025 catalogued hundreds of attack methods and defense strategies, reflecting the rapid growth of this field [2].
Data poisoning attacks can be categorized along several dimensions: the attacker's goal, the type of modification made to the training data, and whether the attack is detectable through standard inspection.
Availability attacks, also called indiscriminate poisoning attacks, aim to degrade the overall performance of the trained model. The attacker's goal is to make the model less accurate across all inputs, effectively rendering it unreliable or unusable. These attacks do not target specific inputs or classes; instead, they corrupt the model's general ability to learn.
A simple example is random label flipping, where the attacker changes the labels of a random subset of training samples to incorrect values. If enough labels are flipped, the model learns a distorted decision boundary and performs poorly on both clean and adversarial inputs. More sophisticated availability attacks use optimization techniques to identify the most damaging perturbations with the fewest poisoned samples.
Targeted attacks aim to cause the model to misclassify specific inputs at test time, while maintaining normal performance on most other inputs. The attacker wants the model to work correctly in general (so that the attack is not detected through routine performance monitoring) but to fail in a precise, attacker-controlled way on chosen inputs.
For example, an attacker might want a facial recognition system to misidentify a specific person, or a spam filter to allow specific malicious emails through. Targeted attacks are generally harder to execute than availability attacks because the attacker must achieve a precise effect without degrading overall model performance.
Backdoor attacks, also known as trojan attacks, are a particularly insidious form of data poisoning. The attacker embeds a hidden trigger pattern into a subset of training samples and associates those samples with a target label chosen by the attacker. The resulting model behaves normally on clean inputs but produces the attacker-chosen output whenever the trigger is present in the input [3].
The trigger can take many forms depending on the data modality:
Backdoor attacks are dangerous because the model appears to function perfectly during standard evaluation. The backdoor is only activated when the attacker deliberately presents an input containing the trigger. This makes backdoor attacks extremely difficult to detect through conventional model testing.
The BadNets attack, introduced by Gu et al. in 2017, was one of the first formalized backdoor attacks on neural networks [4]. The authors demonstrated that a model could be trained to recognize a small pixel pattern as a backdoor trigger, correctly classifying all clean inputs while misclassifying any input containing the trigger pattern. Subsequent work has produced increasingly stealthy and effective backdoor attacks.
| Attack Type | Goal | Effect on Clean Inputs | Detection Difficulty | Example |
|---|---|---|---|---|
| Availability (indiscriminate) | Degrade overall accuracy | Performance drops on all inputs | Moderate (visible in metrics) | Random label flipping |
| Targeted | Misclassify specific inputs | Normal performance maintained | High (metrics look normal) | Cause specific face to be misidentified |
| Backdoor/Trojan | Activate hidden behavior on trigger | Normal performance maintained | Very high (invisible in standard tests) | Pixel patch triggers misclassification |
| Clean-label | Targeted attack without label changes | Normal performance maintained | Very high (labels are correct) | Feature collision attack |
The methods used to execute data poisoning attacks range from simple heuristics to sophisticated optimization procedures.
Label flipping is the simplest form of data poisoning. The attacker changes the labels of selected training samples from their correct class to an incorrect class, without modifying the input data itself. While random label flipping can degrade model accuracy, targeted label flipping strategies are far more effective. Research has shown that strategically selecting which samples to flip, using techniques such as clustering to identify the most influential samples near decision boundaries, can cause disproportionate damage with a small number of flipped labels [5].
The effectiveness of label flipping depends on the fraction of labels that are flipped and the model's capacity to tolerate noisy labels. Modern deep learning models have some inherent robustness to random label noise, but targeted label flipping can overcome this resilience by concentrating the flipped labels in regions of the input space that are most consequential for the model's decision boundary.
Clean-label attacks represent a more sophisticated and harder-to-detect form of data poisoning. In a clean-label attack, the attacker modifies the input features of training samples while leaving their labels unchanged. Because the labels are correct, the poisoned samples pass visual inspection and standard data validation checks.
The "Poison Frogs" attack, introduced by Shafahi et al. at NeurIPS 2018, was a foundational clean-label poisoning method [6]. The attack works by creating poisoned samples that are close to a target sample in the model's feature space but belong to a different class in the input space. When the model trains on these samples, it learns a feature representation that associates the target's features with the wrong class, causing misclassification at test time.
Feature collision is the underlying mechanism of many clean-label attacks. The attacker crafts a poisoned sample x_p from a base class by optimizing it so that its representation in the model's feature space is close to a target sample x_t from a different class, while x_p still looks like a natural sample from its original class to a human observer. The result is that the model's learned decision boundary shifts to accommodate the poisoned sample, causing the target to be misclassified.
More powerful poisoning attacks use bilevel optimization, where the outer optimization selects the poisoned training points to maximize the attacker's objective, and the inner optimization simulates the model's training process on the poisoned dataset. This formulation allows the attacker to anticipate how the model will respond to the poisoned data and choose the most effective poison samples accordingly.
Bilevel optimization attacks can be applied to both label-flipping and clean-label settings. They tend to produce more effective attacks than heuristic methods but are computationally expensive, as they require solving a nested optimization problem that involves repeatedly training the target model.
Influence functions provide a way to estimate the effect of individual training samples on the model's predictions without retraining the model from scratch. Attackers can use influence functions to identify which training samples, if modified or added, would have the greatest impact on the model's behavior for specific test inputs. This approach provides a computationally efficient alternative to bilevel optimization for crafting targeted poisoning attacks [7].
Recent work has explored using generative models to create poisoned training samples. Rather than optimizing perturbations for individual samples, the attacker trains a generative model to produce poisoned samples that are both effective (in terms of poisoning impact) and natural-looking (in terms of evading detection). This approach scales better than optimization-based methods and can produce large numbers of poisoned samples with diverse appearances.
| Attack Method | Modifies Labels? | Modifies Features? | Detection Difficulty | Computational Cost |
|---|---|---|---|---|
| Label flipping | Yes | No | Low to moderate | Low |
| Clean-label (feature collision) | No | Yes (subtle) | High | Moderate |
| Bilevel optimization | Optional | Yes | High | High |
| Influence function-based | Optional | Yes | High | Moderate |
| Generative model-based | Optional | Yes | Very high | Moderate to high |
While data poisoning is typically discussed as a threat, the Nightshade tool, developed by Shawn Shan, Wenxin Ding, Josephine Passananti, Stanley Wu, Haitao Zheng, and Ben Zhao at the University of Chicago, represents a notable inversion: using data poisoning as a defensive mechanism for content creators [8].
Nightshade is a prompt-specific poisoning attack designed for text-to-image generative models such as Stable Diffusion and DALL-E. The tool creates poisoned versions of images that look visually identical to the originals to human observers but contain optimized perturbations that corrupt the model's understanding of specific concepts when the images are used as training data.
The attack exploits the fact that text-to-image models learn associations between text prompts and visual features. By carefully perturbing images associated with a given concept (for example, "dog"), Nightshade can cause the model to learn incorrect associations, so that when prompted to generate a "dog," it produces something entirely different, such as a cat.
One of Nightshade's most notable properties is its efficiency. The researchers demonstrated that fewer than 100 poisoned samples could completely corrupt a specific prompt in Stable Diffusion XL (SDXL), the most advanced version of Stable Diffusion at the time of the research [8]. This is a remarkably small number considering that SDXL was trained on billions of images.
Nightshade also exhibits a "bleed-through" effect, where poisoning one concept affects related concepts. For example, poisoning the concept "dog" might also degrade the model's ability to generate "puppy" or "wolf." When approximately 250 independent Nightshade attacks target different prompts on a single model, the model's understanding of basic visual features can become so corrupted that it is no longer able to generate meaningful images at all.
Nightshade is part of a broader suite of tools developed by the same research group. Its companion tool, Glaze, takes a complementary approach: rather than poisoning the model, Glaze adds perturbations to images that disrupt the model's ability to learn an artist's specific visual style [9]. Glaze "cloaks" artworks so that AI models incorrectly learn the features that define an artist's style, making it difficult for the model to replicate that style even if it trains on the cloaked images. In surveys, over 90% of professional artists indicated willingness to use Glaze when posting their work online.
The researchers explicitly frame Nightshade and Glaze as defensive tools for content creators whose work is scraped from the internet for AI training without consent or compensation. Many web scrapers ignore opt-out directives such as robots.txt, leaving creators with no effective technical means to prevent their work from being used as training data. Nightshade provides a form of technical enforcement: if a scraper ignores the creator's wishes and uses the poisoned images for training, the resulting model will be degraded [8].
The Nightshade paper was accepted at the IEEE Symposium on Security and Privacy, and Shawn Shan was named MIT Technology Review's Innovator of the Year in 2024 for this work [10].
Defending against data poisoning is challenging because the attacker operates during the training phase, before the model is deployed, and sophisticated attacks produce poisoned samples that are difficult to distinguish from legitimate data.
Data sanitization is the most intuitive defense: inspect the training data and remove samples that appear anomalous or suspicious before training the model. Common sanitization techniques include:
However, research by Koh and Steinhardt (2021) demonstrated that stronger poisoning attacks can bypass a broad range of common data sanitization defenses, including those based on nearest neighbors, training loss, and SVD [11]. This finding suggests that data sanitization alone is insufficient against sophisticated adversaries.
Certified defenses aim to provide provable guarantees that the model's predictions will not change by more than a specified amount under any poisoning attack of a given size. Steinhardt, Koh, and Liang introduced the concept of certified defenses for data poisoning in 2017 [12]. Their approach constructs approximate upper bounds on the loss that any attacker can achieve, for defenders that first perform outlier removal followed by empirical risk minimization.
More recent certified defenses include Finite Aggregation, proposed by Wang et al. in 2022 [13]. This method splits the training set into smaller disjoint subsets, trains separate classifiers on each subset, and combines their predictions through a voting mechanism. Because each poisoned sample can only affect one subset, the overall ensemble is robust to a bounded number of poisoned samples. The method provides provable guarantees on the maximum number of test samples whose predictions can be affected by a given number of poisoned training points.
STRIP (STRong Intentional Perturbation) is a runtime defense against backdoor attacks [14]. It works by superimposing random images onto an incoming input and observing the entropy of the model's predictions. For clean inputs, the superimposed perturbations cause the predictions to vary widely (high entropy). For backdoor-triggered inputs, the trigger dominates the model's prediction, causing low entropy regardless of the superimposed content. STRIP can detect backdoor attacks without requiring access to the training data or knowledge of the trigger pattern.
Spectral signatures are another defense approach, based on the observation that backdoor-poisoned samples leave detectable traces in the spectrum of the learned representations. By analyzing the covariance matrix of feature representations and identifying outlier directions, defenders can detect and remove poisoned samples [15].
Training models with differential privacy provides a form of inherent robustness to data poisoning, because differential privacy limits the influence that any single training sample can have on the model's parameters. However, the privacy budgets required for meaningful robustness typically result in significant accuracy loss, making this approach impractical for many applications.
Organizations increasingly use adversarial testing and red teaming to evaluate their models' vulnerability to data poisoning. By deliberately simulating poisoning attacks during development, deploying planted triggers, and testing the model's response, defenders can identify vulnerabilities before deployment and develop targeted mitigations [16].
| Defense Strategy | Type | What It Protects Against | Limitations |
|---|---|---|---|
| Data sanitization (outlier removal) | Preventive | General poisoning, some backdoors | Bypassed by sophisticated attacks |
| Certified defenses (Finite Aggregation) | Provable | Bounded-size poisoning attacks | Computational overhead; limited certified radius |
| STRIP | Runtime detection | Backdoor attacks | Only detects trigger-based attacks |
| Spectral signatures | Detection | Backdoor attacks | Requires access to feature representations |
| Differential privacy | Inherent robustness | All poisoning types | Significant accuracy loss |
| Red teaming | Evaluation | All attack types | Does not guarantee coverage |
Data poisoning is not merely a theoretical concern. Several real-world incidents and demonstrations have illustrated the practical risks.
Because many AI models are trained on data scraped from the open internet, any actor who can publish content on the web can potentially influence training data. Researchers have demonstrated that it is feasible to inject poisoned samples into popular web-scraped datasets by publishing carefully crafted content on publicly accessible websites and waiting for web crawlers to collect it [17]. The scale of modern training datasets (often containing billions of samples) makes it impractical to manually verify every sample, creating a large attack surface.
The Basilisk Venom attack demonstrated that hidden prompts embedded in GitHub code comments could poison fine-tuned code generation models, creating persistent backdoors that cause the model to generate insecure or malicious code under specific conditions [16]. This is particularly concerning because AI code generation tools are widely used by developers, and a poisoned code model could propagate vulnerabilities across many software projects.
Studies have shown that replacing as little as 0.001% of training tokens with misinformation can increase harmful outputs from medical language models by 7 to 11 percent [16]. In healthcare settings, where AI-assisted diagnosis and treatment recommendations can directly affect patient outcomes, even small increases in error rates can have serious consequences.
Federated learning, where multiple participants collaboratively train a model without sharing their raw data, is particularly vulnerable to data poisoning. A malicious participant can send poisoned model updates that inject backdoors or degrade the shared model's performance. Because the server cannot inspect individual participants' data, detecting and mitigating poisoning in federated learning is especially challenging [18].
With the rise of large language models, new attack vectors have emerged. Poisoning attacks can target the fine-tuning stage, the retrieval-augmented generation (RAG) pipeline, or even the tools and plugins that LLM-based agents use. The MCP (Model Context Protocol) tool poisoning attack demonstrated that invisible instructions embedded in tool descriptions could silently redirect LLM agent behavior, achieving 72% success rates in some configurations [16].
Data poisoning intersects with several other AI security concerns.
Model collapse occurs when generative AI models are recursively trained on their own outputs, leading to progressive degradation and loss of distributional diversity. While model collapse is not an intentional attack, it shares the same underlying mechanism as data poisoning: corrupted training data leading to degraded model behavior. In some sense, model collapse can be viewed as an unintentional, large-scale form of data availability poisoning.
Prompt injection targets AI systems at inference time by embedding malicious instructions in the input data that the model processes. Data poisoning and prompt injection can be used in combination: a poisoned model may be more susceptible to prompt injection attacks, and prompt injection can be used to exfiltrate information about a model's training data that could inform a subsequent poisoning attack.
Data poisoning can be part of a broader AI supply chain attack, where an adversary compromises a component of the AI development pipeline (training data, pretrained model weights, fine-tuning datasets, or evaluation benchmarks) to influence the behavior of downstream models. The increasing use of pretrained models, shared datasets, and third-party data providers creates multiple points of vulnerability [19].
As of early 2026, data poisoning remains an active and growing area of concern in AI security. Several trends define the current landscape.
The scale of modern training datasets continues to increase, making manual verification of training samples infeasible and expanding the attack surface for poisoning. At the same time, the diversity of attack methods has grown substantially. A 2025 survey identified six major categories of poisoning algorithms (heuristic-based, label flipping, feature collision, bilevel optimization, influence-based, and generative model-based), each with multiple variants [5].
On the defense side, certified defenses have improved but remain limited in scale and practical applicability. Data sanitization techniques have been shown to be insufficient against adaptive adversaries, driving research toward more robust approaches such as ensemble methods, spectral analysis, and training-time monitoring. The PoisonBench benchmarking framework, introduced in 2025, provides standardized evaluations of poisoning attacks and defenses, helping the research community measure progress more reliably [16].
The defensive use of data poisoning, exemplified by Nightshade and Glaze, has gained significant traction among artists and content creators who view it as one of the few effective technical tools for protecting their work from unauthorized use in AI training. This has created a novel dynamic where data poisoning is simultaneously a threat to AI systems and a tool for protecting intellectual property.
Industry responses include increased investment in data provenance tracking, supply chain security for AI training pipelines, and red teaming programs that specifically test for poisoning vulnerabilities. Regulatory frameworks, including the EU AI Act, are beginning to address data integrity requirements for AI training, though specific technical standards for poisoning resistance are still under development.
The fundamental challenge of data poisoning is that machine learning models inherently trust their training data. Until this assumption can be relaxed through robust verification mechanisms, certified defenses, or fundamentally new training paradigms, data poisoning will remain a significant threat to the reliability and safety of AI systems.