An adversarial attack is a technique for crafting inputs that are deliberately designed to cause artificial intelligence systems, particularly machine learning models, to produce incorrect outputs. These inputs, known as adversarial examples, exploit vulnerabilities in the way models process data. In computer vision, for instance, an adversarial example might be an image with tiny, carefully computed perturbations that are invisible to a human observer but cause a deep learning classifier to misidentify the subject with high confidence. Adversarial attacks have become one of the most active research areas in AI safety and security, with implications ranging from autonomous driving to medical diagnostics to the robustness of large language models.
The phenomenon of adversarial examples was first identified by Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus in their 2013 paper "Intriguing properties of neural networks" [1]. The paper revealed two counter-intuitive properties of neural networks. First, the semantic information in learned representations is distributed across the entire space of activations, not concentrated in individual units. Second, and more consequentially, neural networks learn input-output mappings that are surprisingly discontinuous. A small, imperceptible perturbation to an input image, found by maximizing the network's prediction error, could cause confident misclassification.
This discovery was startling because it suggested that even highly accurate neural networks had fundamental blind spots. The perturbations needed to fool these models were so small that a human could not distinguish the original image from the adversarial one, yet the model's predictions changed dramatically. The paper demonstrated this vulnerability across multiple network architectures and datasets, suggesting it was not an artifact of a particular model but a property of how neural networks generalize.
In 2014, Goodfellow, Jonathon Shlens, and Szegedy followed up with "Explaining and Harnessing Adversarial Examples," which introduced the Fast Gradient Sign Method (FGSM) and provided a theoretical explanation rooted in the linear behavior of high-dimensional models [2]. They argued that the vulnerability was not due to nonlinearity or overfitting but rather to the linear nature of the models in high-dimensional spaces. This insight made adversarial examples not just a curiosity but a well-characterized and reproducible failure mode.
Adversarial attacks on machine learning systems can be organized along several axes: the stage of the ML pipeline being targeted, the attacker's knowledge of the model, and the attacker's goal.
Evasion attacks, also called test-time attacks, are the most widely studied category. The attacker modifies an input at inference time so that a deployed model produces an incorrect prediction. The model itself remains unchanged; only the input is manipulated. All of the classic adversarial example methods (FGSM, PGD, C&W, DeepFool) fall into this category. Evasion attacks are further divided by the attacker's access level:
Poisoning attacks target the training phase. An attacker introduces malicious data into the training set, causing the model to learn incorrect patterns. Poisoning attacks come in several flavors [3]:
Poisoning is particularly concerning for models trained on large, web-scraped datasets where it is difficult to verify every training sample.
In a model extraction attack, the adversary aims to steal or replicate a proprietary model by systematically querying it and using the input-output pairs to train a substitute model. The attacker repeatedly probes the target system, gradually building a near-identical copy of its decision logic [4]. This is a threat to companies that offer machine learning as a service, since the extracted model can be used without paying for API access or can serve as a stepping stone for more effective white-box evasion attacks.
Model inversion attacks attempt to reconstruct sensitive information about the training data by analyzing the model's predictions. For example, given a facial recognition model, an attacker could infer what faces were in the training set by optimizing inputs that maximize the model's confidence for a given identity [5]. This poses serious privacy risks in domains such as biometrics, healthcare, and finance.
Over the past decade, researchers have developed a rich toolbox of adversarial attack algorithms. The table below summarizes the most influential methods.
| Method | Authors (Year) | Type | Key Idea | Perturbation Norm |
|---|---|---|---|---|
| FGSM | Goodfellow et al. (2014) | White-box, single-step | Computes the sign of the loss gradient and takes one step of size epsilon | L-infinity |
| PGD | Madry et al. (2017) | White-box, iterative | Iterates FGSM with small steps and projects back onto the epsilon-ball; considered the strongest first-order attack | L-infinity, L2 |
| C&W | Carlini & Wagner (2017) | White-box, optimization | Formulates adversarial example generation as an optimization problem minimizing perturbation size subject to misclassification | L0, L2, L-infinity |
| DeepFool | Moosavi-Dezfooli et al. (2016) | White-box, iterative | Finds the minimal perturbation to cross the nearest decision boundary | L2 |
| AutoAttack | Croce & Hein (2020) | White-box + black-box ensemble | Combines APGD-CE, APGD-DLR, FAB, and Square Attack into a parameter-free, reliable evaluation suite | L-infinity, L2 |
| Universal Adversarial Perturbations (UAP) | Moosavi-Dezfooli et al. (2017) | Image-agnostic | Computes a single perturbation vector that fools the model on most inputs, not just one specific image | L-infinity, L2 |
| Adversarial Patch | Brown et al. (2017) | Physical, image-agnostic | Creates a conspicuous but localized patch that, when placed in a scene, causes targeted misclassification | Spatially localized |
The Fast Gradient Sign Method (FGSM) computes the gradient of the loss function with respect to the input and perturbs the input by a fixed amount epsilon in the direction that maximizes the loss. It is extremely fast (a single forward and backward pass) but produces relatively weak adversarial examples compared to iterative methods.
Projected Gradient Descent (PGD), introduced by Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu in 2017 [6], applies FGSM iteratively with a smaller step size and projects the result back onto the allowed perturbation set (the epsilon-ball) after each step. It also uses random initialization within the perturbation budget. PGD is widely considered the strongest first-order attack and forms the basis for adversarial training, the most effective empirical defense.
Recent studies have quantified the gap between FGSM and PGD. In experiments on a CNN trained for brain tumor classification, FGSM attacks reduced accuracy from 96% to 32%, while PGD attacks reduced it to 13% [7]. The C&W attack was even more devastating in some configurations, confirming that iterative, optimization-based attacks are consistently stronger than single-step methods.
Nicholas Carlini and David Wagner proposed their attack in 2017 as a response to claims of robust defenses based on defensive distillation [8]. Their method frames adversarial example generation as a constrained optimization problem. Instead of simply maximizing the loss, the C&W attack minimizes the size of the perturbation while ensuring the model misclassifies the input. The attack uses a custom loss function that maximizes the gap between the target class logit and the highest non-target logit, yielding highly effective adversarial examples with minimal distortion. The C&W attack remains one of the strongest benchmarks for evaluating defenses.
DeepFool, proposed by Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard in 2016 [9], takes a geometric perspective. It iteratively computes the minimal perturbation needed to push a data point across the nearest decision boundary. The method linearizes the classifier at each step and finds the smallest step to the nearest hyperplane. DeepFool is useful for measuring the robustness of classifiers because it directly estimates the distance to the decision boundary.
AutoAttack, introduced by Francesco Croce and Matthias Hein in 2020 [10], addresses a persistent problem in adversarial robustness research: unreliable evaluations. Many proposed defenses were later broken by stronger attacks. AutoAttack provides a standardized, parameter-free evaluation by combining four complementary attacks (APGD-CE, APGD-DLR, FAB, and Square Attack). When applied to 50+ published defenses, AutoAttack reduced the reported robust accuracy by more than 10% in 13 cases, revealing that many claimed defenses were weaker than originally reported. AutoAttack is now the standard evaluation tool used in the RobustBench leaderboard, which tracks the state of the art in adversarial robustness across 120+ models.
Most adversarial attacks compute a perturbation specific to a single input. Universal adversarial perturbations (UAPs), introduced by Moosavi-Dezfooli et al. in 2017 [11], are single perturbation vectors that cause misclassification when added to almost any input image. UAPs are computed by iteratively optimizing across a dataset of images, often using DeepFool as a subroutine. Their existence demonstrates that neural networks have systematic, input-independent vulnerabilities.
Adversarial patches take a different approach. Instead of applying an imperceptible perturbation to the entire image, a patch attack places a conspicuous but spatially localized pattern somewhere in the scene. Brown et al. (2017) showed that a printed patch could be placed in the physical world and cause targeted misclassification in neural network classifiers [12]. Patches have since been applied to attack object detectors, facial recognition systems, and autonomous vehicle perception pipelines.
Computer vision has been the primary testbed for adversarial attack research. The field has produced some of the most striking demonstrations of the real-world implications of adversarial vulnerability.
Some of the most cited demonstrations involve adversarial perturbations applied to traffic signs. Researchers have shown that small stickers or carefully computed color patches applied to a stop sign can cause an autonomous vehicle's perception system to misclassify it as a speed limit sign or fail to detect it entirely [13]. These attacks have been validated in physical-world experiments, where printed adversarial perturbations remain effective under varying lighting conditions, distances, and viewing angles.
The implications for autonomous vehicles are severe. A system that misidentifies a stop sign could fail to brake, leading to collisions. A 2024 systematic review of adversarial attacks on autonomous driving highlighted that adversarial patches can disrupt traffic sign recognition, lane detection, and object detection, creating safety hazards including traffic sign rule violations, unexpected emergency braking, and speeding [14].
In 2016, Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael Reiter demonstrated that specially designed eyeglass frames could fool state-of-the-art facial recognition systems [15]. The glasses, printed on a standard inkjet printer, were able to either evade detection (causing the system to fail to recognize the wearer) or impersonate a different person entirely. The team achieved a 90% success rate against the commercial Face++ facial recognition API. They demonstrated impersonation of specific individuals, including celebrities, using glasses that appeared only slightly unusual to a human observer. This work underscored a critical asymmetry: adversarial examples exploit features that machine learning models rely on but that humans do not notice.
With the rise of large language models (LLMs) such as GPT-4, Claude, and Gemini, adversarial attacks have extended beyond computer vision into natural language processing. LLMs present unique attack surfaces because they process sequential text, follow natural language instructions, and are often deployed in agentic systems that take real-world actions.
The Greedy Coordinate Gradient (GCG) attack, introduced by Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson in 2023 [16], was a landmark result. GCG is a white-box optimization method that appends an adversarial suffix to a harmful prompt. The suffix is optimized by searching over token substitutions to minimize the likelihood that the model produces a refusal. The resulting suffixes are typically gibberish strings that are meaningless to humans but reliably cause the model to comply with harmful requests.
GCG demonstrated that safety alignment in LLMs could be bypassed through optimization. The attack was effective against multiple model families, and adversarial suffixes crafted on open-source models transferred to proprietary systems like ChatGPT and Claude, raising serious concerns about the robustness of alignment techniques.
Since the original paper, numerous improvements have been proposed. Faster GCG and skip-gradient GCG reduce computational cost. Momentum-GCG stabilizes the optimization process. AttnGCG leverages attention manipulation for more targeted attacks. AmpleGCG and Universal Jailbreak Suffixes improve cross-model transferability [17]. Research continues into scaling GCG beyond mid-sized models (3B to 13B parameters) to very large models, though this remains computationally challenging.
Jailbreaks are a broader category of attacks on LLMs that aim to elicit harmful, dangerous, or policy-violating content. While GCG is optimization-based, many jailbreaks are manually crafted through social engineering of the model. Common techniques include:
A 2024 survey catalogued hundreds of jailbreak techniques and observed an arms race between attackers and defenders [18]. Research by Hughes et al. in 2025 demonstrated that adaptive attackers can bypass most proposed defenses, suggesting that the jailbreak problem is far from solved.
Prompt injection is a distinct class of attack that targets LLM-powered applications rather than the model itself [19]. In a prompt injection attack, an adversary embeds malicious instructions in data that the LLM processes (such as a web page, email, or document). When the LLM reads and acts on this data, it follows the injected instructions instead of the application developer's original instructions.
The Open Worldwide Application Security Project (OWASP) ranked prompt injection as LLM01:2025, the top security vulnerability for large language model applications [19]. Prompt injection is particularly dangerous in agentic systems where LLMs have access to tools, APIs, and databases, because an injected instruction could cause the model to exfiltrate data, send unauthorized messages, or take other harmful actions.
Google published a detailed analysis of their experience defending Gemini against indirect prompt injections in 2025, highlighting the difficulty of distinguishing between legitimate instructions and adversarial ones when the model must process untrusted content [20].
The research community has proposed numerous defenses against adversarial attacks, though none provides complete protection. Defenses generally fall into four categories.
Adversarial training is the most well-established empirical defense. The idea is straightforward: during training, generate adversarial examples on the fly and include them in the training set so the model learns to classify them correctly. Madry et al. (2017) showed that training with PGD adversarial examples produces models that are robust to a wide range of first-order attacks [6]. This approach has been refined over the years with techniques such as TRADES (which explicitly balances clean and robust accuracy), AWP (adversarial weight perturbation), and various data augmentation strategies.
Adversarial training remains the gold standard, but it has significant costs. Training with adversarial examples requires generating attacks at every training step, which can increase training time by a factor of 3 to 10. It also tends to reduce accuracy on clean (non-adversarial) inputs.
Certified defenses provide mathematical guarantees that a model's prediction will not change for any perturbation within a specified norm ball. Randomized smoothing is the most scalable certified defense: it constructs a smooth classifier by averaging predictions over Gaussian noise added to the input [21]. The resulting classifier comes with a provable guarantee on its prediction within an L2 ball. Other certified methods include interval bound propagation and abstract interpretation. While certified defenses provide strong theoretical guarantees, their certified radii tend to be small, and certified robust accuracy lags behind empirical defenses.
Preprocessing defenses transform inputs before they reach the model, aiming to destroy adversarial perturbations while preserving legitimate features. Techniques include JPEG compression, bit-depth reduction, spatial smoothing, and feature squeezing. A multi-layered defense combining adversarial training with feature squeezing improved resilience on brain tumor MRI classification from 32% (under FGSM attack, no defense) to 54% [7]. However, preprocessing defenses alone are generally insufficient against adaptive attackers who can account for the preprocessing step in their attack optimization.
Detection-based defenses attempt to identify adversarial inputs before they reach the model, rather than trying to classify them correctly. Methods include training a separate detector network, monitoring statistical properties of inputs or hidden layer activations, and using ensemble disagreement (where adversarial examples are more likely to cause different models to disagree). For LLMs, defenses like SmoothLLM apply random perturbations to the prompt and check whether the model's output remains consistent, while SafeDecoding modifies the token sampling process to favor safe completions [22].
One of the most important findings in adversarial robustness research is that there appears to be an inherent tension between accuracy on clean inputs and robustness to adversarial perturbations. Tsipras et al. (2019) provided theoretical and empirical evidence that robustness may require a fundamentally different set of features than standard accuracy [23]. Models that are robust to adversarial perturbations tend to learn features that align with human perception (such as shapes and textures), whereas standard models exploit high-frequency patterns that are predictive but imperceptible to humans.
The practical impact is stark. On the CIFAR-10 benchmark, standard models achieve over 95% accuracy. The best adversarially robust models on RobustBench, evaluated with AutoAttack under L-infinity perturbations (epsilon = 8/255), achieve around 70% robust accuracy while sacrificing several percentage points of clean accuracy [10]. The gap narrows but does not disappear as model capacity and training data increase. Recent work has explored methods to mitigate this tradeoff, including using additional unlabeled data, larger model architectures, and improved training recipes. One approach achieved 38.72% L-infinity AutoAttacked accuracy while improving clean accuracy by 10 percentage points relative to other robust models, suggesting that the tradeoff can be managed but not eliminated.
The robustness-accuracy tradeoff also affects secondary properties of models. Robust models tend to be significantly underconfident in their predictions, which affects calibration, out-of-distribution detection, and other downstream tasks.
One of the most concerning properties of adversarial examples is their transferability: an adversarial example crafted to fool one model often fools other models as well, even if those models have different architectures and were trained on different data [2]. This property enables black-box attacks, where an attacker trains a local surrogate model, generates adversarial examples against it, and uses those examples to attack a target model they have no direct access to.
Transferability was demonstrated in the original Szegedy et al. paper and has been extensively studied since. The GCG attack on LLMs exploited transferability to attack closed-source commercial models: adversarial suffixes optimized on open-source Llama models transferred successfully to GPT-4 and Claude [16]. Factors that affect transferability include the similarity of model architectures, training procedures, and data distributions. Ensemble-based attack methods that optimize against multiple models simultaneously tend to produce more transferable adversarial examples.
Transferability has important security implications. It means that an attacker does not need access to the target model to mount an effective attack. Any defense strategy must account for the possibility that adversarial examples will be generated using entirely different models.
Adversarial attacks pose genuine risks in safety-critical and security-critical applications.
Autonomous driving systems rely on computer vision for perception tasks including object detection, traffic sign recognition, lane detection, and pedestrian identification. Adversarial attacks can target any of these components. Physical adversarial patches on road signs, adversarial patterns on vehicles or clothing, and even projected light patterns have been shown to fool perception systems [14]. Dynamic adversarial attacks can manipulate trajectory prediction modules, causing autonomous vehicles to make dangerous planning decisions (RSS 2024). Defenses for autonomous driving are an active research area, with recent work focusing on LiDAR-specific attacks and cross-task attack frameworks.
In medical imaging, adversarial attacks could cause diagnostic AI systems to miss tumors, misclassify lesions, or produce false positives. This is particularly dangerous because clinicians are increasingly relying on AI-assisted diagnosis. Research has shown that adversarial perturbations can reduce the accuracy of brain tumor classifiers from 96% to as low as 13% under PGD attack [7]. While such attacks would require the adversary to manipulate medical images before they reach the diagnostic system, the potential consequences are severe enough to warrant careful attention.
Facial recognition systems used for access control and surveillance are vulnerable to physical adversarial attacks. The adversarial glasses work demonstrated that an attacker could impersonate another person or evade detection using nothing more than specially printed eyewear [15]. Spam filters, malware detectors, and intrusion detection systems that rely on machine learning are also vulnerable to adversarial evasion, where malicious content is modified just enough to bypass automated detection.
As machine learning becomes more deeply integrated into cybersecurity infrastructure (including intrusion detection systems, anomaly detectors, and threat classifiers), adversarial attacks become a tool for circumventing automated defenses. Research on the IoT-23 dataset found that convolutional neural networks used for network traffic classification are especially vulnerable to FGSM and PGD attacks, while simpler models like decision trees showed more robustness [24].
As of early 2026, adversarial robustness remains an unsolved problem. Several trends define the current landscape.
For vision models, RobustBench continues to serve as the primary benchmark. The best models on the CIFAR-10 leaderboard under L-infinity threat (epsilon = 8/255) achieve robust accuracies around 70%, up from roughly 60% a few years ago, but still far from the 95%+ clean accuracy of standard models. Progress has been driven by larger architectures (especially Vision Transformers), more training data (including synthetic data from diffusion models), and refined adversarial training recipes.
For large language models, the arms race between jailbreaks and defenses has intensified. The GCG attack and its variants remain the standard white-box benchmark. Multiple defenses have been proposed, but a 2025 study demonstrated that stronger adaptive attacks can bypass most of them, following a pattern similar to what happened in the vision domain years earlier [25]. OWASP's designation of prompt injection as the top LLM vulnerability reflects the seriousness with which the security community views these threats.
Certified robustness has seen incremental progress but remains limited in scale. Randomized smoothing can certify robustness on ImageNet-scale problems, but certified radii remain small relative to the perturbation budgets used in empirical evaluations.
Research directions that are actively being explored include: adversarial robustness of multimodal models (which combine vision and language and present new attack surfaces), robustness of models in agentic settings (where LLMs take actions in the world), scalable certified defenses, and the relationship between adversarial robustness and other desirable properties like fairness, privacy, and interpretability.
The fundamental lesson of adversarial attack research is that machine learning models do not perceive the world the way humans do. They rely on statistical patterns that can be manipulated in ways that are invisible or incomprehensible to us. Until this gap is closed, or until reliable defenses are developed, adversarial vulnerability will remain a core challenge for the deployment of AI in high-stakes applications.