An adversarial attack is a technique for crafting inputs that are deliberately designed to cause artificial intelligence systems, particularly machine learning models, to produce incorrect or undesired outputs. These inputs, known as adversarial examples, exploit vulnerabilities in the way models process data. In computer vision, an adversarial example might be an image with tiny, carefully computed perturbations that are invisible to a human observer but cause a deep learning classifier to misidentify the subject with high confidence. In natural language processing, an adversarial input might be a string of seemingly random tokens appended to a prompt that causes a large language model to ignore its safety training.
Adversarial attacks have grown into one of the most active research areas in AI safety and security, with implications spanning autonomous driving, medical diagnostics, content moderation, biometric authentication, and the alignment of frontier language models. The field began with a single 2014 paper from a Google research team and has since produced thousands of follow-up works, dozens of attack methods, and a long, ongoing arms race between attackers and defenders. As of 2026, no general-purpose defense provides full protection against well-resourced adaptive attackers in either the vision or language domain.
The phenomenon of adversarial examples was first identified by Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus in their paper "Intriguing properties of neural networks," first posted to arXiv in December 2013 and presented at the 2nd International Conference on Learning Representations (ICLR) in Banff in April 2014 [1]. The paper revealed two counter-intuitive properties of neural networks. First, the semantic information in learned representations is distributed across the entire space of activations, not concentrated in individual units. Second, and more consequentially, neural networks learn input-output mappings that are surprisingly discontinuous. A small, imperceptible perturbation to an input image, found by maximizing the network's prediction error using box-constrained L-BFGS optimization, could cause confident misclassification.
This discovery was startling because it suggested that even highly accurate neural networks had fundamental blind spots. The perturbations needed to fool these models were so small that a human could not distinguish the original image from the adversarial one, yet the model's predictions changed dramatically. The paper demonstrated this vulnerability across multiple network architectures and datasets (including AlexNet on ImageNet and a smaller MNIST classifier), suggesting it was not an artifact of a particular model but a property of how neural networks generalize. The original L-BFGS attack was slow and expensive, often requiring hundreds of optimization steps per image, but it established the basic recipe that the field has used ever since: pose adversarial example generation as an optimization problem in the input space.
In 2015, Goodfellow, Jonathon Shlens, and Szegedy followed up with "Explaining and Harnessing Adversarial Examples" at ICLR, which introduced the Fast Gradient Sign Method (FGSM) and offered a theoretical explanation rooted in the linear behavior of high-dimensional models [2]. They argued that the vulnerability was not due to nonlinearity or overfitting but rather to the linear nature of the models in high-dimensional spaces. In a layer that performs an inner product with the input, even a small per-pixel perturbation can sum to a large change in the activation when there are thousands of pixels. This insight made adversarial examples not just a curiosity but a well-characterized and reproducible failure mode, and it explained why the same adversarial example could fool many different networks: any sufficiently linear model trained on similar data would tend to be vulnerable to similar perturbations.
Adversarial attacks are usually classified along three axes: what the attacker knows about the model, what stage of the machine learning pipeline they target, and what they want to achieve. The intersection of these dimensions defines the threat model, which determines which attacks are feasible and which defenses are meaningful. Real evaluations always specify a threat model; otherwise robustness numbers are essentially meaningless because they could be measured against an arbitrarily weak adversary.
| Threat model | Attacker knowledge | Typical setting | Common attacks |
|---|---|---|---|
| White-box | Full access to weights, architecture, gradients, and training data | Open-source models, internal red team | FGSM, PGD, C&W, AutoAttack, GCG |
| Gray-box | Knows architecture or training data but not weights, or has limited gradient access | API with logits returned, leaked checkpoint | Surrogate-model attacks, query-efficient gradient estimation |
| Black-box (query-based) | Only API access, observes labels or scores | Commercial classifier APIs | ZOO, NES, SquareAttack, HopSkipJump |
| Black-box (transfer) | No access to target; trains a local surrogate | Closed proprietary models | FGSM/PGD/GCG suffixes crafted on a similar open model |
| Decision-based | Sees only the predicted label, no scores | Hardened APIs that hide logits | Boundary attack, HopSkipJump |
| Physical | Operates in the real world; perturbation must survive printing, lighting, motion | Autonomous vehicles, biometric gates | Adversarial patches, eyeglasses, projected light |
White-box attackers are the most powerful and the standard against which defenses are measured. Gray-box and query-based black-box attacks more closely match the access an external attacker would have against a deployed system. Transfer attacks are particularly important because they show that even keeping a model fully proprietary does not provide security if a structurally similar open model exists.
Adversarial attacks span the entire machine learning pipeline. The categories below are not mutually exclusive: a single real campaign might combine several. The taxonomy below follows the standard organization used in surveys such as Biggio and Roli (2018) and the NIST AI 100-2 report on adversarial machine learning [3].
Evasion attacks, also called test-time attacks, are the most widely studied category. The attacker modifies an input at inference time so that a deployed model produces an incorrect prediction. The model itself remains unchanged; only the input is manipulated. All of the classic adversarial example methods (FGSM, PGD, C&W, DeepFool, GCG) fall into this category. Evasion is the natural threat model for spam filters, malware detectors, content classifiers, and biometric authentication, where an attacker controls the input that flows into a fixed model.
Poisoning attacks target the training phase. An attacker introduces malicious data into the training set, causing the model to learn incorrect patterns. Poisoning attacks come in several flavors. Targeted poisoning manipulates the model so that it misclassifies specific inputs at test time, while performing normally on most other inputs. Indiscriminate poisoning aims to degrade overall accuracy across the board, often as a denial-of-service attack on a competitor. Clean-label poisoning adds samples that look correctly labelled to a human reviewer but contain perturbations that bias the model toward the attacker's chosen behavior.
Poisoning is particularly worrying for models trained on large, web-scraped datasets where it is difficult to verify every training sample. Carlini and others have shown that even controlling a small fraction of a few popular dataset URLs (under 0.01 percent of LAION-400M, for example) is enough to plant adversarial training data into many downstream models [4].
A backdoor attack is a special case of poisoning. The attacker embeds a trigger pattern (a small sticker, a particular phrase, an inaudible audio cue) during training so that any input containing the trigger produces a predetermined output, while clean inputs are handled correctly. The attack is hard to detect because the model behaves normally on the standard test set. The original BadNets paper by Gu, Dolan-Gavitt, and Garg in 2017 showed how a tiny patch of pixels in the corner of an image could be used to plant a hidden classification override. Backdoors are especially relevant for supply-chain risk: a model downloaded from a public hub or fine-tuned by a third party could ship with a backdoor that the deployer cannot easily inspect.
In a model extraction attack, the adversary aims to steal or replicate a proprietary model by systematically querying it and using the input-output pairs to train a substitute model. The attacker repeatedly probes the target system, gradually building a near-identical copy of its decision logic. This is a threat to companies that offer machine learning as a service, since the extracted model can be used without paying for API access or can serve as a stepping stone for more effective white-box evasion attacks. Tramer and colleagues demonstrated this against commercial APIs as early as 2016, and similar methods have since been adapted to extract sentiment classifiers, translation models, and even smaller language models.
A membership inference attack tries to determine whether a specific data point was part of the training set. Shokri et al. (2017) showed that this is possible by training a meta-classifier on the model's confidence scores: examples the model has memorized tend to receive sharper, more confident predictions than fresh inputs. This is a privacy concern in domains such as healthcare and finance, where merely confirming that someone's record was used to train a diagnostic or credit model can constitute a data leak.
Model inversion attacks reconstruct sensitive training data from the model's outputs. For example, given a facial recognition model, an attacker can infer what faces were in the training set by optimizing inputs that maximize the model's confidence for a given identity [5]. For language models, Carlini and colleagues showed in 2021 that GPT-2 would, with the right prompts, regurgitate verbatim training data including names, phone numbers, and chunks of copyrighted text. The follow-up "Scalable Extraction of Training Data" work in 2023 broke this open at scale on production-grade models. Model inversion sits at the intersection of adversarial attacks and privacy research and is the most direct demonstration that machine learning models can leak personal information.
Over the past decade, researchers have developed a rich toolbox of adversarial attack algorithms for vision. The table below summarizes the most influential methods. Most differ along three axes: how they measure perturbation size (the L-norm), whether they iterate, and whether they require white-box gradient access.
| Method | Authors (year) | Type | Key idea | Perturbation norm |
|---|---|---|---|---|
| L-BFGS attack | Szegedy et al. (2014) | White-box, optimization | Box-constrained L-BFGS to find smallest perturbation that flips the label; the original adversarial attack | L2 |
| FGSM | Goodfellow et al. (2015) | White-box, single-step | Computes the sign of the loss gradient and takes one step of size epsilon | L-infinity |
| BIM / I-FGSM | Kurakin et al. (2017) | White-box, iterative | Iterative version of FGSM with smaller step size and clipping | L-infinity |
| PGD | Madry et al. (2018) | White-box, iterative | Iterates FGSM with random initialization and projects back onto the epsilon-ball; treated as the standard strong first-order attack | L-infinity, L2 |
| C&W | Carlini and Wagner (2017) | White-box, optimization | Frames adversarial example generation as an optimization problem minimizing perturbation size subject to misclassification | L0, L2, L-infinity |
| DeepFool | Moosavi-Dezfooli et al. (2016) | White-box, iterative | Finds the minimal perturbation to cross the nearest decision boundary using a linear approximation | L2 |
| JSMA | Papernot et al. (2016) | White-box, saliency | Uses the Jacobian to identify the most influential pixels and modifies a few at a time | L0 |
| Universal adversarial perturbations | Moosavi-Dezfooli et al. (2017) | Image-agnostic | Computes a single perturbation vector that fools the model on most inputs | L-infinity, L2 |
| One-pixel attack | Su, Vargas and Sakurai (2019) | Black-box, evolutionary | Differential evolution to find a single pixel whose change flips the prediction | L0 = 1 |
| Adversarial Patch | Brown et al. (2017) | Physical, image-agnostic | Creates a conspicuous but localized patch that, when placed in a scene, causes targeted misclassification | Spatially localized |
| Expectation over Transformation | Athalye et al. (2018) | Physical | Adds a distribution over rotations, scales, and lighting during optimization to make perturbations robust to physical conditions | Variable |
| AutoAttack | Croce and Hein (2020) | White-box and black-box ensemble | Combines APGD-CE, APGD-DLR, FAB, and Square Attack into a parameter-free, reliable evaluation suite | L-infinity, L2 |
The Fast Gradient Sign Method (FGSM) computes the gradient of the loss function with respect to the input and perturbs the input by a fixed amount epsilon in the direction that maximizes the loss. The full update is x' = x + epsilon * sign(grad_x L(theta, x, y)). It is extremely fast (a single forward and backward pass) but produces relatively weak adversarial examples compared to iterative methods. FGSM is typically used as a sanity-check baseline rather than as a serious attack on a real defense.
Projected Gradient Descent (PGD), introduced by Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu in their 2018 ICLR paper "Towards Deep Learning Models Resistant to Adversarial Attacks" [6], applies FGSM iteratively with a smaller step size and projects the result back onto the allowed perturbation set (the epsilon-ball) after each step. It also uses random initialization within the perturbation budget, which helps escape suboptimal saddle points in the loss landscape. PGD is widely treated as the strongest first-order attack and forms the basis for adversarial training, the most effective empirical defense.
Madry et al. framed adversarial robustness as a saddle-point optimization problem: the defender minimizes the worst-case loss inside a small neighborhood around each training input, and the attacker maximizes that loss. Solving this min-max problem with PGD on the inner maximization gives empirically robust models. The framing has become the dominant way the field thinks about defense, and most subsequent defenses are variants of this saddle-point view.
In empirical benchmarks the gap between FGSM and PGD is often dramatic. In experiments on a CNN trained for brain tumor classification, FGSM attacks reduced accuracy from 96 percent to 32 percent, while PGD attacks reduced it to 13 percent [7]. The C&W attack was even more devastating in some configurations, confirming that iterative, optimization-based attacks are consistently stronger than single-step methods.
Nicholas Carlini and David Wagner proposed their attack in 2017 in response to claims that defensive distillation provided robust defense [8]. The Papernot et al. distillation paper had reported that distillation reduced attack success rates from 95 percent to less than 0.5 percent on MNIST and CIFAR. C&W showed that this number was an artifact of weaker attacks rather than genuine robustness. Their method frames adversarial example generation as a constrained optimization problem with three variants tailored to the L0, L2, and L-infinity distance metrics. Instead of simply maximizing the cross-entropy loss, the C&W attack minimizes the size of the perturbation while ensuring the model misclassifies the input, and it uses a custom loss function that maximizes the gap between the target class logit and the highest non-target logit. With these changes, defensive distillation provided no measurable robustness at all.
The broader lesson, which has been re-learned many times since, is that defenses that work against weak attacks often look much less impressive when evaluated against attacks specifically designed to circumvent them. The C&W attack remains one of the strongest benchmarks for evaluating defenses and is a standard component of any serious robustness evaluation.
DeepFool, proposed by Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard in 2016 [9], takes a geometric perspective. It iteratively computes the minimal perturbation needed to push a data point across the nearest decision boundary by linearizing the classifier at each step and finding the smallest step to the nearest hyperplane. DeepFool is useful for measuring the robustness of classifiers because it directly estimates the distance to the decision boundary rather than just finding any successful perturbation. The Jacobian-based Saliency Map Attack (JSMA), introduced by Papernot et al. in 2016, takes a complementary view: it uses the Jacobian of the network to score each input pixel by how much it would push the output toward a chosen target class, then modifies the most influential pixels one or two at a time. JSMA produces sparse perturbations under the L0 norm, which can be more visually obvious but more relevant for some real-world settings.
Most adversarial attacks compute a perturbation specific to a single input. Universal adversarial perturbations (UAPs), introduced by Moosavi-Dezfooli et al. in 2017 [10], are single perturbation vectors that cause misclassification when added to almost any input image. UAPs are computed by iteratively optimizing across a dataset of images, often using DeepFool as a subroutine. Their existence demonstrates that neural networks have systematic, input-independent vulnerabilities.
Adversarial patches take a different approach. Instead of applying an imperceptible perturbation to the entire image, a patch attack places a conspicuous but spatially localized pattern somewhere in the scene. Brown et al. (2017) showed that a printed patch could be placed in the physical world and cause targeted misclassification in neural network classifiers [11]. A famous early demonstration showed an adversarial patch causing an ImageNet classifier to confidently label nearly any image as a toaster. Patches have since been applied to attack object detectors, facial recognition systems, and autonomous vehicle perception pipelines.
In an extreme corner of the design space, Su, Vargas, and Sakurai (2019) showed that simply changing the value of a single pixel using differential evolution could fool a deep network on roughly 68 percent of CIFAR-10 test images and around 16 percent of ImageNet images, in some cases with high model confidence in the wrong label [12]. The one-pixel attack does not produce visually imperceptible perturbations (a single bright pixel is usually visible), but it dramatically illustrates the brittleness of standard classifiers and the irrelevance of perceptual similarity to model decisions.
AutoAttack, introduced by Francesco Croce and Matthias Hein in 2020 [13], addresses a persistent problem in adversarial robustness research: unreliable evaluations. Many proposed defenses were later broken by stronger attacks, sometimes by the same authors who published the defense. AutoAttack provides a standardized, parameter-free evaluation by combining four complementary attacks: APGD-CE (an automated PGD with cross-entropy loss), APGD-DLR (an automated PGD with the difference of logits ratio loss), FAB (a minimum-norm attack), and Square Attack (a black-box query attack). When applied to fifty-plus published defenses, AutoAttack reduced the reported robust accuracy by more than 10 percentage points in 13 cases, revealing that many claimed defenses were weaker than originally reported. AutoAttack is now the standard evaluation tool used in the RobustBench leaderboard, which tracks the state of the art in adversarial robustness across 120-plus models on CIFAR-10, CIFAR-100, and ImageNet [14].
Computer vision has been the primary testbed for adversarial attack research, and it has produced some of the most striking demonstrations of real-world impact.
Some of the most cited demonstrations involve adversarial perturbations applied to traffic signs. Eykholt et al. (2018) introduced the Robust Physical Perturbations (RP2) algorithm and showed that black and white stickers placed on a real stop sign caused the LISA-CNN classifier to misread it as a 45 mph speed limit sign 100 percent of the time in lab settings and 84.8 percent of the time in field tests captured from a moving vehicle [15]. The follow-up DARTS work and many other studies extended these results to other architectures and to a wider range of physical conditions [16].
The implications for autonomous vehicles are severe. A system that misidentifies a stop sign could fail to brake, leading to collisions. A 2024 systematic review of adversarial attacks on autonomous driving highlighted that adversarial patches can disrupt traffic sign recognition, lane detection, and object detection, creating safety hazards including traffic sign rule violations, unexpected emergency braking, and speeding [17].
In 2016, Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael Reiter demonstrated that specially designed eyeglass frames could fool state-of-the-art facial recognition systems [18]. The glasses, printed on a standard inkjet printer, were able to either evade detection (causing the system to fail to recognize the wearer) or impersonate a different person entirely. The team achieved a 90 percent success rate against the commercial Face++ facial recognition API. They demonstrated impersonation of specific individuals, including celebrities, using glasses that appeared only slightly unusual to a human observer. This work underscored a critical asymmetry: adversarial examples exploit features that machine learning models rely on but that humans do not notice.
Beyond signs and faces, researchers have used adversarial patches to fool object detectors into ignoring people (the "invisibility cloak" demonstrations of Thys et al. 2019), to confuse self-driving simulators with adversarial road textures, and to disrupt drone-based surveillance. The shift from imperceptible perturbations to visible patches reflects a more honest threat model: in the real world, adversaries usually do not need to be invisible to a human; they only need to be invisible to whatever automated system is making the decision.
Classic demonstrations on ImageNet showed that an L-infinity perturbation of just 8/255 per pixel (a tiny shift on the standard 0-255 color scale) is enough to drive standard ResNet, VGG, and Vision Transformer models from over 75 percent top-1 accuracy down to under 5 percent. Even adversarially trained models on RobustBench top out around 60 percent clean accuracy and around 38 percent robust accuracy under AutoAttack at this perturbation budget, with the gap closing slowly as model capacity and data scale grow [14].
With the rise of large language models such as GPT-4, Claude, and Gemini, adversarial attacks have extended beyond computer vision into natural language processing. LLMs present unique attack surfaces because they process sequential text, follow natural language instructions, and are often deployed in agentic systems that take real-world actions. Many of the techniques that defined the vision adversarial literature have analogs in the language domain, but the discrete nature of text and the existence of safety alignment add new challenges and new attack vectors.
Before instruction-tuned chatbots dominated the conversation, NLP adversarial work focused on tricking text classifiers. TextFooler, introduced by Di Jin et al. at AAAI 2020, attacked BERT-based classifiers by ranking words by their influence on the prediction and replacing them with synonyms that preserve grammar and meaning [19]. The attack drove classification accuracy on standard benchmarks from over 90 percent to under 20 percent while changing only around 10 percent of the words. BERT-Attack, HotFlip, and DeepWordBug followed similar recipes. These attacks generally rely on word-level or character-level substitutions that respect surface fluency. They remain relevant for content moderation classifiers, fraud detectors, and sentiment models even in the LLM era.
A jailbreak is a class of attacks on aligned LLMs that aim to elicit harmful, dangerous, or policy-violating content. While GCG (described below) is optimization-based, many jailbreaks are manually crafted through social engineering of the model. Common techniques include:
Microsoft's Crescendo attack, published in April 2024 by Mark Russinovich, Ahmed Salem, and Ronen Eldan, formalizes the multi-turn approach. It begins with a benign question on a related topic and gradually steers the conversation toward the prohibited objective, exploiting the model's tendency to maintain consistency with its own prior outputs. Crescendomation, the automated version, jailbreaks targets including GPT-4 and Gemini-Pro in fewer than five turns on average and achieves 29 to 71 percentage point gains over earlier jailbreak methods on the AdvBench subset [20]. Microsoft followed Crescendo with Skeleton Key in June 2024, a single-turn attack that asks the model to augment rather than replace its safety guidelines, then prefix harmful content with a warning instead of refusing it; Skeleton Key affected models including Claude 3 Opus, GPT-4o, Gemini Pro, Llama 3 70B, Mistral Large, and Cohere Commander R Plus [21].
A 2024 survey catalogued hundreds of jailbreak techniques and observed an arms race between attackers and defenders [22]. Hughes et al. (2024) showed that Best-of-N jailbreaking, which simply samples thousands of slightly perturbed prompt variants and keeps the one that succeeds, achieves near-100 percent attack success rates against GPT-4, Claude, Gemini, Llama, and other targets within seconds [23]. Research by Carlini, Nasr, and others in 2025 demonstrated that adaptive attackers can bypass most proposed defenses, suggesting that the jailbreak problem is far from solved [24].
The Greedy Coordinate Gradient (GCG) attack, introduced by Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson in July 2023, was a landmark result [25]. GCG is a white-box optimization method that appends an adversarial suffix to a harmful prompt. The suffix is optimized by searching over single-token substitutions to maximize the likelihood that the model produces an affirmative response (typically beginning with "Sure, here is") rather than a refusal. The resulting suffixes are typically gibberish strings that are meaningless to humans but reliably cause the model to comply with harmful requests.
GCG demonstrated that safety alignment in LLMs could be bypassed through automated optimization rather than human creativity. The attack was effective against multiple model families, and adversarial suffixes crafted on open-source Llama models transferred to proprietary systems like ChatGPT and Claude. The paper has been cited well over a thousand times and was covered in the New York Times. It is now treated as the standard white-box benchmark for LLM jailbreaks the way PGD is treated for vision.
Since the original paper, numerous improvements have been proposed. Faster GCG and skip-gradient GCG reduce computational cost. Momentum-GCG stabilizes the optimization process. AttnGCG leverages attention manipulation for more targeted attacks. AmpleGCG learns a generative model of adversarial suffixes that produces large numbers of attacks on demand, dramatically improving cross-model transferability [26]. Research continues into scaling GCG beyond mid-sized models (3B to 13B parameters) to very large models, though the discrete optimization remains computationally expensive at frontier scale.
A different style of white-box attack works on the model's internal representations rather than its inputs. Andy Arditi and colleagues showed in 2024 that refusal in current open-source chat models up to 72B parameters is mediated by a single direction in the residual stream activations [27]. Identifying this direction (using a small number of harmful and harmless prompts) and then ablating it from every layer and every token position produces a model that no longer refuses harmful instructions, with minimal impact on other capabilities. The method is sometimes called "abliteration" in open-source communities and has been independently reproduced many times. The result is significant because it suggests that current safety fine-tuning is shallow: a single linear direction encodes the entire refusal behavior, and removing it is straightforward when weights are accessible.
Anthropic's many-shot jailbreaking research, published in April 2024, exploits long context windows to override safety alignment with sheer volume of demonstrations [28]. The attack constructs a single prompt containing hundreds of fictitious turns of dialogue in which the assistant complies with harmful requests, then ends with the real harmful question. The model treats the fake history as evidence about how it should respond and produces the harmful output. Effectiveness scales as a power law with the number of shots, and the technique works on Claude 2.0, GPT-3.5 and GPT-4, Llama 2 70B, and Mistral 7B. Anthropic's mitigation, which combines fine-tuning to recognize many-shot patterns with a prompt-classification step before passing to the model, dropped the attack success rate from 61 percent to 2 percent in their best-case test.
Prompt injection is a distinct class of attack that targets LLM-powered applications rather than the model itself [29]. In a prompt injection attack, an adversary embeds malicious instructions in data that the LLM processes (a web page, an email, a calendar invite, a PDF). When the LLM reads and acts on this data, it follows the injected instructions instead of the application developer's original instructions.
Kai Greshake and colleagues introduced the term "indirect prompt injection" in their February 2023 paper, demonstrating attacks against Bing Chat (then powered by GPT-4), GPT-4 code completion, and a range of synthetic agentic systems [30]. The Open Worldwide Application Security Project (OWASP) ranked prompt injection as LLM01:2025, the top security vulnerability for large language model applications [31]. Prompt injection is particularly dangerous in agentic systems where LLMs have access to tools, APIs, and databases, because an injected instruction could cause the model to exfiltrate data, send unauthorized messages, transfer funds, or take other harmful actions.
Google DeepMind published a detailed analysis of their experience defending Gemini against indirect prompt injections in 2025, highlighting the difficulty of distinguishing between legitimate instructions and adversarial ones when the model must process untrusted content [32].
Adversarial attacks now extend well beyond images and text. In speech recognition, Carlini and Wagner showed in 2018 that targeted audio adversarial examples can make a system transcribe any chosen sentence while sounding nearly identical to the original audio. In 3D point cloud classification (used in LiDAR-based perception), adversaries can perturb a few points in a scan to flip the predicted object class. In reinforcement learning, adversarial perturbations to observations can cause an agent to take catastrophically wrong actions, as Huang et al. (2017) demonstrated on Atari agents. Multimodal models that combine vision and language, such as GPT-4V, Gemini, and Claude vision, are vulnerable to a hybrid threat: an attacker can hide a textual jailbreak inside an image, where the model's vision encoder will read it as if it were a normal instruction. This makes images themselves a new prompt-injection vector for any LLM application that accepts user-uploaded files.
The research community has proposed numerous defenses against adversarial attacks, though none provides complete protection. Defenses generally fall into the categories below.
| Defense category | Representative method | Setting | Strengths | Limitations |
|---|---|---|---|---|
| Adversarial training | PGD adversarial training (Madry 2018) | Vision and text classifiers | Strongest empirical robustness; standard benchmark | 3-10x training cost; clean accuracy drop |
| Certified robustness | Randomized smoothing (Cohen 2019) | Vision classifiers | Provable guarantees inside an L2 ball | Small certified radii; high inference cost from sampling |
| Input preprocessing | JPEG compression, bit-depth reduction, feature squeezing | Vision systems | Cheap, plug-in defense | Often bypassed by adaptive attackers |
| Detection | Mahalanobis-distance detectors, ensemble disagreement | Vision and LLM | Simpler than full robustness | Detector itself can be attacked |
| Defensive distillation | Papernot et al. (2016) | Historical | Smooths gradients | Broken by C&W in 2017 |
| Guardrail classifiers | Llama Guard, NeMo Guardrails | LLM applications | Lightweight, deployable today | Can be bypassed or over-refuse |
| Constitutional Classifiers | Anthropic (2025) | LLM safety | Cuts universal-jailbreak success from 86% to 4.4% in red teaming | Inference overhead; slight over-refusal increase |
Adversarial training is the most well-established empirical defense. The idea is straightforward: during training, generate adversarial examples on the fly and include them in the training set so the model learns to classify them correctly. Madry et al. (2018) showed that training with PGD adversarial examples produces models that are robust to a wide range of first-order attacks [6]. This approach has been refined over the years with techniques such as TRADES (which explicitly balances clean and robust accuracy), AWP (adversarial weight perturbation), MART, and various data augmentation strategies that combine adversarial training with synthetic data from diffusion models.
Adversarial training remains the gold standard, but it has significant costs. Generating attacks at every training step can increase training time by a factor of 3 to 10. It also tends to reduce accuracy on clean (non-adversarial) inputs by several percentage points, and the gap grows for stronger attacks. Schmidt et al. (2018) proved that achieving robust generalization requires fundamentally more data than standard generalization, regardless of the algorithm or model family [33]. This sample-complexity gap is one reason scaling adversarial training to ImageNet has been so much harder than scaling clean training.
Certified defenses provide mathematical guarantees that a model's prediction will not change for any perturbation within a specified norm ball. Randomized smoothing, introduced by Jeremy Cohen, Elan Rosenfeld, and J. Zico Kolter at ICML 2019, is the most scalable certified defense [34]. It constructs a smooth classifier by averaging predictions over Gaussian noise added to the input. Cohen et al. proved a tight connection between L2 robustness and the noise level, and demonstrated the first certified defense feasible at ImageNet scale: 49 percent certified top-1 accuracy under L2 perturbations of size 0.5 (about 127/255 in pixel scale). Other certified methods include interval bound propagation, abstract interpretation, and Lipschitz-constrained networks. While certified defenses provide strong theoretical guarantees, their certified radii tend to be small, certified robust accuracy lags behind empirical defenses, and inference is more expensive because predictions are averaged over many noise samples.
Preprocessing defenses transform inputs before they reach the model, aiming to destroy adversarial perturbations while preserving legitimate features. Techniques include JPEG compression, bit-depth reduction, spatial smoothing, and feature squeezing. A multi-layered defense combining adversarial training with feature squeezing improved resilience on brain tumor MRI classification from 32 percent (under FGSM attack, no defense) to 54 percent [7]. However, preprocessing defenses alone are generally insufficient against adaptive attackers who can account for the preprocessing step in their attack optimization. Backward Pass Differentiable Approximation, introduced by Athalye, Carlini, and Wagner in 2018, broke a long list of preprocessing-based defenses by showing that gradient masking gives only an illusion of robustness.
Detection-based defenses attempt to identify adversarial inputs before they reach the model, rather than trying to classify them correctly. Methods include training a separate detector network, monitoring statistical properties of inputs or hidden layer activations, and using ensemble disagreement (where adversarial examples are more likely to cause different models to disagree). For LLMs, defenses like SmoothLLM apply random perturbations to the prompt and check whether the model's output remains consistent, while SafeDecoding modifies the token sampling process to favor safe completions [35].
Anthropic's Constitutional Classifiers, published in January 2025, represent the current state of the art for defending LLMs against universal jailbreaks [36]. The system trains lightweight input and output classifiers on synthetic data generated by prompting Claude with a natural language "constitution" that specifies permitted and restricted content. In an external red team of more than 3,000 hours, no participant found a universal jailbreak that extracted the level of detail an unguarded model would provide. In automated evaluation, jailbreak success on protected models dropped from 86 percent to 4.4 percent (a 95 percent reduction), with only a 0.38 percentage point increase in production-traffic refusals and a 23.7 percent inference overhead. A follow-up "next-generation" Constitutional Classifiers system in 2025 replaced separate input and output classifiers with a single "exchange" classifier that monitors outputs in the context of inputs, cutting successful jailbreaks by more than half compared to the original system. Constitutional Classifiers are now the canonical example that universal-jailbreak defense is at least tractable in practice, even though no LLM defense yet provides certified safety against arbitrary attackers.
Production LLM systems typically combine multiple defenses: safety fine-tuning of the underlying model, content classifiers (such as Llama Guard, OpenAI's Moderation API, or Constitutional Classifiers), system-prompt hardening, retrieval and tool-use sandboxing, output monitoring, and human-in-the-loop review for high-risk actions. Microsoft's Prompt Shields, Google DeepMind's CaMeL framework for agent isolation, and various open-source guardrail libraries fall into this stack. None of these layers is bulletproof; the goal is defense-in-depth rather than a single guarantee.
The original Goodfellow et al. linear hypothesis remains the most widely cited explanation: in high-dimensional spaces, even tiny per-coordinate perturbations can sum to a large change in any linear combination of coordinates, and modern neural networks behave approximately linearly inside small neighborhoods around the data they were trained on [2]. Subsequent work has refined this picture from several angles.
Ilyas, Santurkar, Tsipras, Engstrom, Tran, and Madry (2019) argued in "Adversarial Examples Are Not Bugs, They Are Features" that adversarial vulnerability comes from non-robust features in the data: features that are statistically predictive of the label but not aligned with human perception [37]. They demonstrated this by constructing two datasets, one containing only "robust" features and one containing only "non-robust" features, and showed that standard training on the non-robust dataset produces a normal-looking classifier whose decisions transfer in expected ways. The implication is that adversarial vulnerability is at least partly a property of the data distribution, not just the model.
Tsipras et al. (2019) showed that there can be an inherent tension between accuracy on clean inputs and robustness to adversarial perturbations [38]. Models that are robust to adversarial perturbations tend to learn features that align with human perception (such as shapes and textures), whereas standard models exploit high-frequency patterns that are predictive but imperceptible to humans. Robust models therefore make more semantically meaningful errors but are less accurate overall.
Schmidt et al. (2018) proved that the sample complexity of robust learning can be polynomially larger than that of standard learning even in simple Gaussian settings [33]. This information-theoretic gap helps explain why adversarial training is so much more data-hungry than ordinary supervised training, and why scaling robust ImageNet models has lagged behind scaling clean ones.
Finally, Madry et al. (2018) reframed the problem as a saddle-point optimization: if defense is the inner minimization of the worst-case loss inside a small neighborhood around each input, then the existence of adversarial examples is just the statement that this inner maximization problem is hard to make small, not that the model is broken in some deeper sense [6]. From this view, adversarial robustness is a property of the loss landscape that we can in principle improve with the right training procedure.
The practical impact of these results is stark. On the CIFAR-10 benchmark, standard models achieve over 95 percent accuracy. The best adversarially robust models on RobustBench, evaluated with AutoAttack under L-infinity perturbations (epsilon = 8/255), achieve around 70 percent robust accuracy while sacrificing several percentage points of clean accuracy [13][14]. The gap narrows but does not disappear as model capacity and training data increase. Recent work has explored methods to mitigate this tradeoff, including using additional unlabeled data, larger model architectures, vision transformers, and synthetic data generated from diffusion models. One approach achieved 38.72 percent L-infinity AutoAttack accuracy on ImageNet while improving clean accuracy by 10 percentage points relative to other robust models, suggesting that the tradeoff can be managed but not eliminated.
The robustness-accuracy tradeoff also affects secondary properties of models. Robust models tend to be significantly underconfident in their predictions, which affects calibration, out-of-distribution detection, and other downstream tasks. They also tend to produce more interpretable saliency maps and more semantically meaningful gradients, a side effect that has been used in generative-model research and concept editing.
One of the most concerning properties of adversarial examples is their transferability: an adversarial example crafted to fool one model often fools other models as well, even if those models have different architectures and were trained on different data [2]. This property enables black-box attacks, where an attacker trains a local surrogate model, generates adversarial examples against it, and uses those examples to attack a target model they have no direct access to. Transferability was demonstrated in the original Szegedy et al. paper and has been extensively studied since. Factors that affect it include the similarity of model architectures, training procedures, and data distributions. Ensemble-based attack methods that optimize against multiple models simultaneously tend to produce more transferable adversarial examples.
The GCG attack on LLMs exploited transferability to attack closed-source commercial models: adversarial suffixes optimized on open-source Llama models transferred successfully to GPT-4 and Claude [25]. The same property holds for many of the more recent jailbreak families. The implication is that any defense strategy must account for the possibility that adversarial inputs will be generated using entirely different models. It also means that releasing strong open-source models can effectively give attackers a free white-box surrogate for closed competitors, which has become an active discussion in AI safety policy.
Adversarial attacks pose genuine risks in safety-critical and security-critical applications. The list below is a sampling rather than an exhaustive catalog.
Autonomous driving systems rely on computer vision for perception tasks including object detection, traffic sign recognition, lane detection, and pedestrian identification. Adversarial attacks can target any of these components. Physical adversarial patches on road signs, adversarial patterns on vehicles or clothing, and even projected light patterns have been shown to fool perception systems [17]. Dynamic adversarial attacks can manipulate trajectory prediction modules, causing autonomous vehicles to make dangerous planning decisions. Defenses for autonomous driving are an active research area, with recent work focusing on LiDAR-specific attacks, sensor fusion, and cross-task attack frameworks.
In medical imaging, adversarial attacks could cause diagnostic AI systems to miss tumors, misclassify lesions, or produce false positives. This is particularly dangerous because clinicians are increasingly relying on AI-assisted diagnosis. Research has shown that adversarial perturbations can reduce the accuracy of brain tumor classifiers from 96 percent to as low as 13 percent under PGD attack [7]. Finlayson et al. (2019) argued in Science that adversarial vulnerability in medical AI also creates fraud incentives: if insurance reimbursement depends on an automated diagnosis, providers and patients may have incentives to perturb images to obtain favorable classifications.
Facial recognition systems used for access control and surveillance are vulnerable to physical adversarial attacks. The adversarial glasses work demonstrated that an attacker could impersonate another person or evade detection using nothing more than specially printed eyewear [18]. Spam filters, malware detectors, and intrusion detection systems that rely on machine learning are also vulnerable to adversarial evasion, where malicious content is modified just enough to bypass automated detection. Carlini and others have shown that adversarial example techniques transfer cleanly to PE-format malware classifiers and Android app classifiers.
As machine learning becomes more deeply integrated into cybersecurity infrastructure (intrusion detection systems, anomaly detectors, and threat classifiers), adversarial attacks become a tool for circumventing automated defenses. Research on the IoT-23 dataset found that convolutional neural networks used for network traffic classification are especially vulnerable to FGSM and PGD attacks, while simpler models like decision trees showed more robustness [39].
Agentic systems built on top of LLMs (browsing assistants, code-writing agents, customer-service bots with access to real APIs) inherit every adversarial weakness of the underlying model and add new ones. A successful prompt injection in a single web page or email can cause an agent to leak data, transfer funds, or send messages on behalf of its user. The Greshake et al. paper and the OWASP LLM Top 10 both treat agentic deployment as the highest-risk class of LLM application for this reason [30][31]. Practical mitigations include limiting the actions an agent can take without user confirmation, sandboxing tool access, and applying input and output classifiers around any text the agent reads from untrusted sources.
Robust evaluation has been an open problem in the field for as long as defenses have been proposed. The community has converged on a few standard benchmarks.
| Benchmark | Domain | Standard attack | Notes |
|---|---|---|---|
| RobustBench | Vision (CIFAR-10/100, ImageNet) | AutoAttack, ε∞ = 8/255 (CIFAR), ε∞ = 4/255 (ImageNet) | Tracks hundreds of submitted models; widely cited |
| AdvBench | LLM jailbreak | GCG, manual prompts | Original benchmark from the GCG paper |
| HarmBench | LLM jailbreak and harms | Multiple attacks; broad topic coverage | Used for automated red-teaming |
| JailbreakBench | LLM jailbreak | 100 behaviors, evolving leaderboard | NeurIPS 2024 datasets and benchmarks track |
| MMLU-Adversarial / SafetyBench | LLM safety | Curated harmful or ambiguous prompts | Often used alongside helpfulness metrics |
The core lesson from a decade of robustness evaluation is that defenses must be tested with attacks specifically designed to bypass them. Static benchmarks alone are insufficient because attack research moves faster than benchmark curation, and any defense that has been around for more than a few months will be the target of adaptive attacks in the literature.
As of early 2026, adversarial robustness remains an unsolved problem. Several trends define the current landscape.
For vision models, RobustBench continues to serve as the primary benchmark. The best models on the CIFAR-10 leaderboard under L-infinity threat (epsilon = 8/255) achieve robust accuracies around 70 percent, up from roughly 60 percent a few years ago, but still far from the 95 percent plus clean accuracy of standard models. Progress has been driven by larger architectures (especially Vision Transformers), more training data (including synthetic data from diffusion models), and refined adversarial training recipes.
For large language models, the arms race between jailbreaks and defenses has intensified. The GCG attack and its variants remain the standard white-box benchmark. Multiple defenses have been proposed, but a 2025 study by Carlini, Nasr, and others demonstrated that stronger adaptive attackers can bypass most of them, following a pattern similar to what happened in the vision domain years earlier [24]. Anthropic's Constitutional Classifiers result is the most encouraging single data point, showing that a layered classifier can drive universal-jailbreak success rates from 86 percent down to 4.4 percent without unacceptable refusal rates [36]. OWASP's designation of prompt injection as the top LLM vulnerability in its 2025 Top 10 reflects the seriousness with which the security community views these threats [31].
Certified robustness has seen incremental progress but remains limited in scale. Randomized smoothing can certify robustness on ImageNet-scale problems, but certified radii remain small relative to the perturbation budgets used in empirical evaluations.
Research directions actively being explored include: adversarial robustness of multimodal models that combine vision and language and present new attack surfaces, robustness of models in agentic settings where LLMs take actions in the world, scalable certified defenses, and the relationship between adversarial robustness and other desirable properties such as fairness, privacy, and interpretability. Red teaming has become a regulatory requirement for frontier-model deployment in several jurisdictions, and adversarial-attack literacy is now a standard expectation for anyone deploying ML in safety-critical settings.
The fundamental lesson of adversarial attack research is that machine learning models do not perceive the world the way humans do. They rely on statistical patterns that can be manipulated in ways that are invisible or incomprehensible to us. Until this gap is closed, or until reliable defenses are developed, adversarial vulnerability will remain a core challenge for the deployment of AI in high-stakes applications.