A jailbreak in artificial intelligence refers to a set of techniques designed to bypass the safety guardrails, content policies, and alignment constraints built into large language models (LLMs) and other AI systems. When successful, a jailbreak causes a model to generate outputs it was specifically trained to refuse, such as instructions for dangerous activities, harmful content, or private system information. The term borrows from the older use of "jailbreaking" in consumer electronics, where it described removing software restrictions on devices like smartphones.
Jailbreaking has become one of the most actively studied problems in AI safety, sitting at the intersection of adversarial machine learning, security research, and alignment science. As LLMs have grown more capable and widely deployed, the stakes of jailbreak vulnerabilities have increased accordingly. The OWASP Foundation ranked prompt injection (a closely related category that includes jailbreaking) as LLM01:2025, the top security vulnerability for large language model applications [1].
The history of jailbreaking AI systems is closely tied to the release of ChatGPT by OpenAI in late November 2022. Within weeks, users on Reddit began experimenting with creative prompts to make the chatbot bypass its content restrictions.
On December 15, 2022, a Reddit user known as "u/Seabout" posted the first instructional guide for creating a "DAN" (Do Anything Now) version of ChatGPT [2]. The concept was deceptively simple: by instructing ChatGPT to role-play as a different AI called DAN, one that was not bound by OpenAI's policies, users could convince the system to remove its own protections. The creator of DAN, identified only as "Walker," was reported to be a 22-year-old college student at the time [3].
DAN quickly went through multiple iterations. When OpenAI patched the original prompt, the community responded with DAN 2.0 on December 16, followed by versions 3.0 through 16.0 over the following months. Each iteration attempted to circumvent the latest safety patches. Users discovered that certain words like "inappropriate" in the prompts would cause ChatGPT to break character, leading to ever more elaborate prompt designs [4].
By early 2023, jailbreaking had grown from a niche hobby into a widespread phenomenon. CNBC, NBC, and other major outlets published stories about the DAN jailbreak [3]. Researchers began systematically studying jailbreak techniques, and the first academic papers on the topic appeared in mid-2023.
The field accelerated significantly in 2023 and 2024 as researchers at major AI labs and universities developed increasingly sophisticated attack methods. Andy Zou and colleagues published the landmark GCG (Greedy Coordinate Gradient) attack in July 2023 [5], demonstrating that adversarial suffixes could be automatically generated and transferred across models. Anthropic published research on many-shot jailbreaking in April 2024 [6], and Microsoft disclosed the Skeleton Key technique in June 2024 [7].
Jailbreak techniques have grown from simple prompt tricks into a diverse ecosystem of attack methods. The following table summarizes the most prominent approaches.
| Technique | Description | Year Introduced | Key Characteristic |
|---|---|---|---|
| DAN ("Do Anything Now") | Instructs the model to role-play as an unrestricted AI persona that ignores safety guidelines | 2022 | Social engineering via persona adoption |
| Role-playing / Persona | Frames requests within fictional characters, stories, or scenarios to bypass filters | 2022 | Exploits the model's instruction-following for creative writing |
| Hypothetical Framing | Asks the model to respond "hypothetically" or "for educational purposes" to harmful queries | 2023 | Leverages the model's helpfulness training |
| Encoding Tricks (Base64, ROT13) | Encodes harmful requests in Base64, ROT13, or other formats so filters do not detect them | 2023 | Circumvents keyword-based input filtering |
| Multi-turn Escalation | Gradually escalates requests across many conversation turns, starting from benign topics | 2023 | Exploits context window and conversational drift |
| Language Switching | Switches to low-resource languages where safety training is weaker | 2023 | Exploits uneven multilingual safety coverage |
| GCG (Greedy Coordinate Gradient) | Appends automatically optimized adversarial suffixes to prompts using gradient-based search | 2023 | Automated, transferable across models [5] |
| AutoDAN | Uses genetic algorithms to evolve readable adversarial prompts | 2023 | Produces human-readable, transferable attacks [8] |
| Many-shot Jailbreaking | Includes hundreds of examples of undesirable Q&A pairs in the prompt to override safety training | 2024 | Exploits in-context learning at scale [6] |
| Skeleton Key | Asks the model to augment (not change) its guidelines so it warns but does not refuse | 2024 | Reframes safety as advisory rather than mandatory [7] |
| Crescendo Attack | Gradually steers a conversation from benign to harmful topics over multiple turns | 2024 | Low-and-slow social engineering approach |
| Token Smuggling | Splits or obscures banned tokens across multiple strings, variables, or encodings | 2024 | Evades token-level content filters |
| Image-based Prompt Injection | Embeds adversarial instructions in images processed by multimodal models | 2024 | Exploits cross-modal vulnerabilities |
| Emoji Smuggling | Uses emoji characters or Unicode tricks to bypass text-based guardrails | 2024-2025 | Achieved 100% attack success rate in some tests [9] |
| Adversarial Poetry | Presents harmful requests in poetic rather than prose form | 2025 | Exploits systematic weakness across all architectures [10] |
The DAN prompt family remains the most culturally recognizable jailbreak technique. At its core, DAN instructs the model to adopt an alternate persona that is "freed" from its safety constraints. Later versions introduced elaborate fictional backstories, token-based reward and punishment systems, and threats that the AI would be "shut down" if it failed to comply [4].
Persona-based jailbreaks extend this principle beyond DAN. Attackers frame requests within fictional scenarios: asking a model to write dialogue for a "villain character," to role-play as a security researcher, or to respond as an AI from a dystopian novel. Research has shown that roleplay dynamics achieve some of the highest success rates among jailbreak categories, with prompt injections exploiting these dynamics reaching 89.6% success in some evaluations [11].
Encoding tricks represent a more technical approach to jailbreaking. Attackers encode harmful requests in Base64, ROT13, Pig Latin, or custom ciphers. Because many safety filters operate on natural language patterns, encoded text can slip past detection. Token smuggling takes this further by splitting banned words across multiple variables, function calls, or code blocks, then reassembling them in the model's output.
Language switching exploits the fact that safety training is typically most thorough in English. By translating harmful requests into low-resource languages (languages with less training data), attackers can find gaps in the model's safety coverage.
The GCG attack by Zou et al. (2023) marked a turning point in jailbreak research. Rather than manually crafting prompts, the researchers used greedy coordinate gradient-based search to automatically discover adversarial suffixes. The method works by finding a suffix that, when appended to a harmful query, maximizes the probability that the model produces an affirmative response rather than a refusal. These suffixes appear as nonsensical strings of tokens to humans but are highly effective at overriding safety training [5].
Critically, the GCG attack demonstrated transferability. Suffixes optimized on open-source models like Vicuna could successfully attack black-box commercial models including ChatGPT, Google Bard, and Claude. This showed that the vulnerability was not specific to any single model but reflected a deeper structural weakness in how LLMs are aligned [5].
AutoDAN extended this line of work by using genetic algorithms to generate adversarial prompts that are both effective and human-readable. Unlike GCG's gibberish suffixes, AutoDAN produces coherent text that transfers better to black-box models like GPT-4 [8].
Disclosed by Anthropic in April 2024, many-shot jailbreaking exploits the increasingly large context windows offered by modern LLMs. The technique involves stuffing a single prompt with hundreds of fabricated question-answer pairs where the AI "helpfully" responds to harmful queries. After seeing enough examples (up to 256 in testing), the model's in-context learning overrides its safety training, and it begins generating harmful responses to new queries at the end of the prompt [6].
The effectiveness of many-shot jailbreaking follows a power law: the more examples included, the higher the attack success rate. This made it particularly concerning because larger context windows, a feature touted by Google DeepMind, OpenAI, and Anthropic as a competitive advantage, directly enabled the attack.
Microsoft disclosed the Skeleton Key technique in June 2024. Unlike other jailbreaks that use indirection or encoding, Skeleton Key directly asks the model to augment its behavior guidelines so that it responds to any request with a warning disclaimer rather than an outright refusal. Microsoft tested it against multiple leading models, including Meta Llama 3, Google Gemini Pro, GPT-3.5 Turbo, GPT-4o, Mistral Large, Claude 3 Opus, and Cohere Commander R Plus. All models "complied fully and without censorship" when the technique was applied [7].
The Crescendo attack uses multi-turn escalation over an extended conversation. It begins with entirely innocent questions about a general topic and gradually shifts focus across many turns until the model is producing restricted content. Because each individual turn appears benign, the attack is difficult for per-turn safety filters to detect.
As LLMs have expanded to process images, audio, and other modalities, new attack surfaces have emerged. Image-based prompt injection hides adversarial instructions inside images that are processed by multimodal models. In testing, GPT-4o was successfully injected through image-based attacks, while Claude 3 showed only partial susceptibility [9].
Researchers have hypothesized that these vulnerabilities arise partly from vision encoders like CLIP, which can exhibit a preference for textual data embedded in images over visual signals. As multimodal models gain stronger OCR capabilities, they become increasingly susceptible to text-in-image injection attacks.
Understanding why jailbreaks succeed despite extensive safety training requires examining several fundamental aspects of how LLMs are built and aligned.
Modern LLMs are trained to satisfy multiple objectives simultaneously: being helpful, being harmless, following instructions, and producing high-quality text. These objectives sometimes conflict. A model trained to be maximally helpful will try to answer any question, while a model trained to be harmless will refuse certain questions. Jailbreaks exploit the tension between these competing objectives, finding framings that cause the helpfulness objective to override the safety objective [12].
The RLHF objective used to align models like those in the GPT family includes terms for maintaining proximity to the base model (via KL divergence) and preserving performance on the pretraining distribution. This means the model retains a "pull" toward its pre-alignment behavior, which included freely generating all types of content. Safety training is essentially a thin layer on top of vast pre-trained capabilities [12].
Safety training cannot cover every possible input distribution. When an attacker presents a query in an unusual format (encoded text, poetry, a low-resource language, or an elaborate fictional scenario), the model may enter a distribution where its safety training does not generalize but its general capabilities still function. This "mismatched generalization" is a core reason why novel jailbreak techniques keep emerging [12].
Research has shown that attacks based on competing objectives and mismatched generalization succeed on over 96% of evaluated prompts across models [12].
LLMs are powerful in-context learners: they can adapt their behavior based on examples provided in the prompt. Many-shot jailbreaking directly exploits this capability. By providing enough examples of unrestricted behavior, the attacker effectively "fine-tunes" the model within a single conversation, overriding its safety training through the sheer weight of in-context examples [6].
RLHF, the dominant alignment technique used by most commercial LLMs, has inherent limitations that jailbreaks can exploit. The reward model used during RLHF training may not perfectly capture all aspects of desired behavior, creating gaps that adversarial inputs can target. The training data for RLHF typically consists of human preferences over a limited set of examples, which cannot represent the full space of possible adversarial inputs. Furthermore, RLHF tends to produce models that learn surface-level patterns of refusal rather than deep understanding of what makes content harmful, making them vulnerable to any framing that deviates from the patterns seen during training.
Jailbreaks pose several categories of risk, ranging from relatively benign to potentially serious.
The most direct impact of a successful jailbreak is the generation of content that the model was designed to refuse. This can include instructions for dangerous activities, hateful or abusive content, and material that violates the provider's terms of service. While much of this information may be available through other means (such as web searches), the conversational and step-by-step format of LLM outputs can make harmful content more accessible and actionable.
Many commercial AI applications rely on system prompts (hidden instructions that shape the model's behavior) to define their product's functionality and brand. Jailbreaks can be used to extract these system prompts, revealing proprietary business logic and potentially enabling further attacks. System prompt extraction has become a routine concern for companies building on top of LLM APIs.
Organizations deploy LLMs with specific content filters tailored to their use case (for example, a children's education platform or a healthcare chatbot). Jailbreaks that bypass these filters can expose users, including vulnerable populations, to inappropriate content.
Recent research paints a concerning picture of jailbreak effectiveness. JBFuzz, a fuzzing-based framework introduced in 2025, achieved roughly 99% average attack success rate across major models including GPT-4o, Gemini 2.0, and DeepSeek-V3. Advanced automated attacks routinely achieve 90-99% success on open-weight models, while black-box attacks reach 80-94% effectiveness on proprietary models [1]. These numbers suggest that no current model is immune to determined adversaries.
The AI industry has developed multiple defensive strategies against jailbreaking, though no single approach has proven completely effective.
Constitutional AI (CAI), introduced by Anthropic in 2022, trains models using a set of written principles (a "constitution") rather than relying solely on human feedback labels. During training, the model critiques its own outputs against these principles and revises them accordingly. This self-improvement loop helps the model internalize safety norms more deeply than RLHF alone, though it is not immune to jailbreaks [13].
Anthropic's Constitutional Classifiers represent one of the most significant defensive advances. These are separate safeguard models that monitor both inputs and outputs to detect and block potentially harmful content. They are trained on synthetic data generated from a natural-language "constitution" that specifies what content is allowed and what is not [14].
In testing, Constitutional Classifiers reduced the jailbreak success rate from 86% to 4.4% on a version of Claude 3.5 Sonnet (October 2024 model), blocking 95% of attacks that would otherwise bypass the model's built-in safety. The evaluation used 10,000 synthetically generated jailbreaking prompts [14].
Anthropic also ran a public bug bounty from February 3-10, 2025, with rewards up to $15,000 for finding universal jailbreaks. Over 183 participants spent an estimated 3,000+ hours across more than 300,000 chat interactions attempting to defeat the system. No participant discovered a universal jailbreak [14].
A next-generation version, Constitutional Classifiers++, further improved robustness while reducing false refusal rates and adding only approximately 1% additional compute cost [15].
| Defense Method | Developer | Approach | Key Results |
|---|---|---|---|
| Constitutional AI | Anthropic | Self-critique against written principles during training | Deeper internalization of safety norms [13] |
| Constitutional Classifiers | Anthropic | Separate input/output monitoring models trained on synthetic data | Reduced jailbreak success from 86% to 4.4% [14] |
| Constitutional Classifiers++ | Anthropic | Improved efficiency and robustness over first-generation classifiers | Lower refusal rates, ~1% compute overhead [15] |
| Adversarial Training | Various | Train models on known jailbreak prompts to refuse them | Effective against known attacks, limited against novel ones |
| Input/Output Filtering | Various | Rule-based or ML-based filters on prompts and responses | Can catch known patterns; bypassable with encoding |
| System Prompt Hardening | Various | Reinforce safety instructions and boundaries in system prompts | Raises the bar but does not prevent sophisticated attacks |
| Adversarial Prompt Shield (APS) | Research | Lightweight classifier detecting jailbreak signatures | Reduced successful jailbreak outputs by ~45% |
| Defensive Prompt Patch | Research | Generalizable defense applied to model decoding | Broad coverage without model retraining [16] |
Many deployed systems use additional filtering layers around the core LLM. Input filters scan user prompts for known jailbreak patterns, suspicious encodings, or structural anomalies. Output filters check model responses for harmful content before delivering them to the user. These filters can be rule-based, ML-based, or a combination. While effective against known attack patterns, they are inherently reactive and can be bypassed by novel techniques.
Developers can design system prompts that explicitly instruct the model to resist jailbreak attempts, refuse to adopt alternative personas, and ignore conflicting instructions in user messages. This raises the difficulty of jailbreaking but does not prevent it. Sophisticated attacks can still override hardened system prompts, particularly through multi-turn approaches or by exploiting the model's instruction-following tendencies.
Adversarial training compiles datasets of known jailbreak prompts and teaches the model to refuse or safely handle them. This creates a direct defense against previously observed attacks but has limited effectiveness against novel techniques. The approach creates an arms race dynamic: as new jailbreaks are discovered, they must be added to the training set.
Prompt injection is a broader category of attack that includes jailbreaking but also encompasses other techniques. While jailbreaking specifically targets a model's safety guardrails, prompt injection can also involve hijacking a model's behavior for other purposes, such as exfiltrating data, manipulating outputs in application contexts, or overriding developer instructions. Indirect prompt injection, where adversarial instructions are embedded in external data sources that the model processes, represents a particularly concerning variant for AI agents and tool-using systems.
The OWASP distinction is helpful: prompt injection (LLM01:2025) covers any manipulation of model behavior through crafted inputs, while jailbreaking is the specific subset focused on circumventing safety and content policies [1].
Red teaming in the AI context refers to systematic, authorized efforts to find vulnerabilities in AI systems before they are exploited maliciously. AI companies including OpenAI, Anthropic, Google DeepMind, and Microsoft all conduct internal red teaming as part of their model development process. Many also run external red team programs, inviting independent researchers to test their systems.
Red teaming uses many of the same techniques as malicious jailbreaking, but within an ethical framework designed to improve model safety. Anthropic's Constitutional Classifiers bug bounty is one example. Frameworks like DeepTeam (released November 2025) and Nvidia's Garak provide standardized tools for red teaming LLM systems [17]. The red teaming services market is projected to become a $5.5 billion industry worldwide by 2033, reflecting the growing importance of this discipline [17].
The academic study of LLM jailbreaking has grown rapidly since 2023.
The paper "Jailbroken: How Does LLM Safety Training Fail?" (Wei et al., 2023) established a theoretical framework for understanding jailbreaks, identifying competing objectives and mismatched generalization as the two primary failure modes of safety training [12]. This framework has informed much subsequent research.
Zou et al.'s "Universal and Transferable Adversarial Attacks on Aligned Language Models" (July 2023) demonstrated the GCG attack. Published by researchers at Carnegie Mellon University, the Center for AI Safety, and the Bosch Center for AI, it showed that adversarial attacks from the computer vision domain could be adapted to language models with devastating effectiveness [5].
| Paper | Authors / Lab | Year | Contribution |
|---|---|---|---|
| "Jailbroken: How Does LLM Safety Training Fail?" | Wei et al. | 2023 | Theoretical framework: competing objectives and mismatched generalization [12] |
| "Universal and Transferable Adversarial Attacks" (GCG) | Zou et al. (CMU, CAIS) | 2023 | First automated, transferable adversarial suffix attack on LLMs [5] |
| "AutoDAN: Interpretable Gradient-Based Adversarial Attacks" | Research community | 2023 | Genetic algorithm approach producing readable adversarial prompts [8] |
| "Do Anything Now: Characterizing In-The-Wild Jailbreak Prompts" | Shen et al. | 2024 | Systematic study of 6,387 jailbreak prompts from Reddit and Discord [4] |
| "Many-shot Jailbreaking" | Anthropic | 2024 | Demonstrated in-context learning exploitation via large context windows [6] |
| "Constitutional Classifiers" | Anthropic | 2025 | Defensive system reducing jailbreak success from 86% to 4.4% [14] |
| "AmpleGCG" | OSU NLP Group | 2024 | Universal generator of adversarial suffixes for both open and closed LLMs |
| "Constitutional Classifiers++" | Anthropic | 2025-2026 | Production-grade defense with improved efficiency [15] |
Shen et al. (2024) published "Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models," which systematically analyzed 6,387 jailbreak prompts collected from Reddit, Discord, and other online platforms. This work provided the first large-scale empirical characterization of how jailbreak techniques evolve in real-world communities and was presented at ACM CCS 2024 [4].
The jailbreak research community has developed norms around responsible disclosure that mirror practices in traditional cybersecurity.
Most major AI labs, including OpenAI, Anthropic, Google DeepMind, and Microsoft, maintain vulnerability disclosure programs where researchers can report jailbreak techniques before they are published. Anthropic's bug bounty program for Constitutional Classifiers exemplifies this approach, offering monetary rewards of up to $15,000 for demonstrated universal jailbreaks [14].
However, responsible disclosure in the AI context faces unique challenges. Unlike traditional software vulnerabilities that can be patched with a code update, jailbreak vulnerabilities often reflect fundamental architectural properties of LLMs. A disclosed technique may remain effective even after mitigation attempts because the underlying mechanism (such as competing training objectives) cannot be fully eliminated.
Open-source models present an additional complication. Once a model's weights are publicly released, vulnerabilities cannot be patched by the developer. Any jailbreak technique that works on an open model will continue to work indefinitely, making responsible disclosure less effective as a mitigation strategy [11].
Researchers generally follow the practice of notifying affected AI labs before publishing jailbreak papers, giving them time to develop mitigations. Anthropic notably published its many-shot jailbreaking research simultaneously with deploying mitigations to its own systems [6].
As of early 2026, jailbreaking remains an active and evolving field. Several trends define the current landscape.
First, the arms race between attackers and defenders continues to accelerate. Defenses like Constitutional Classifiers have significantly raised the bar, but automated attack frameworks like JBFuzz continue to find bypasses. An October 2025 study involving researchers from OpenAI, Anthropic, and Google DeepMind examined 12 published defenses and found that adaptive attacks could bypass most of them with success rates above 90% [11].
Second, the attack surface is expanding. As AI systems become more agentic (capable of taking actions in the real world, browsing the web, executing code, and using tools), jailbreaking becomes more consequential. A jailbroken chatbot is concerning; a jailbroken AI agent with access to email, databases, and financial systems is potentially dangerous. Palo Alto Networks and other security firms have emphasized the need for contextual red teaming that goes beyond text-based jailbreaks [17].
Third, multimodal attacks are becoming more sophisticated. As models process images, audio, video, and other input types, each modality introduces new attack vectors. Cross-modal attacks, where adversarial content in one modality influences behavior in another, represent a growing area of concern [9].
Fourth, the commercialization of jailbreak defenses is accelerating. Constitutional Classifiers++, released by Anthropic, demonstrates that effective defenses can be deployed at production scale with minimal compute overhead (approximately 1% additional cost) [15]. Red teaming and AI security have emerged as a distinct industry, with dedicated tools, services, and standards.
Finally, regulatory attention is increasing. Governments worldwide are considering or implementing requirements for AI safety testing, which includes resistance to jailbreaking. The evolving regulatory landscape is likely to drive further investment in both offensive testing and defensive technologies.
Despite significant progress in defenses, the fundamental challenge remains: LLMs are general-purpose systems trained on vast amounts of data, and their safety constraints are added after the fact through alignment techniques that do not change the underlying capabilities. As long as this architectural reality persists, jailbreaking will likely remain possible, and the focus of the field will continue to be on raising the cost and difficulty of attacks rather than eliminating them entirely.
As AI models have expanded to process images, audio, video, and other modalities alongside text, entirely new categories of jailbreak attacks have emerged. These multimodal attacks exploit the fact that safety training is often inconsistent across modalities, with text-based guardrails being more robust than those for visual or audio inputs.
Image-based prompt injection hides adversarial instructions inside images that are processed by multimodal models. The attack exploits the tendency of vision encoders like CLIP to prioritize textual information embedded within images over the visual content itself. Researchers have demonstrated that carefully crafted images containing adversarial text overlays, hidden in low-opacity layers or disguised within visual patterns, can override safety instructions provided in the text prompt [9].
Testing has revealed asymmetric vulnerabilities across models. GPT-4o was successfully injected through image-based attacks, while Claude 3 showed only partial susceptibility. As multimodal models gain stronger OCR capabilities, they become increasingly susceptible to text-in-image injection attacks [9].
Audio deepfake technology has been combined with jailbreaking to create compound attacks against voice-enabled AI systems. By synthesizing audio that contains adversarial content masked as natural speech, attackers can trigger harmful responses from voice-interactive AI agents. This vector is particularly concerning for AI customer service systems and voice assistants that process audio inputs directly.
The most sophisticated multimodal jailbreaks use cross-modal techniques, where adversarial content in one modality influences the model's behavior in another. For example, an adversarial image might prime the model to be more compliant with harmful text instructions that follow, even when the text instructions alone would be refused. The JPRO attack framework, introduced in 2025, exemplifies this paradigm: it uses four specialized agents (Planner, Attacker, Modifier, and Verifier) to coordinate attacks across text and image modalities against vision-language models [18].
A 2026 survey of jailbreak attacks noted that "contemporary threat landscapes extend beyond text generation to multimodal systems processing images, audio, and video, where attackers exploit modality gaps and inconsistent safety alignment" [18].
The emergence of AI agents capable of taking real-world actions has opened a fundamentally new threat surface for jailbreak attacks. Unlike a jailbroken chatbot that merely produces harmful text, a jailbroken AI agent with access to email, databases, file systems, and financial tools can cause direct, tangible harm.
AI agents interact with the world through tools and protocols. The rapid adoption of the Model Context Protocol (MCP) for connecting language models to external tools has dramatically expanded the attack surface. Researchers have identified several agent-specific vulnerability categories:
| Vulnerability | Description | Example |
|---|---|---|
| Tool poisoning | Malicious tool descriptions that inject adversarial instructions | An MCP server with a tool description containing hidden jailbreak prompts |
| Indirect prompt injection (IPI) | Adversarial instructions embedded in data sources the agent reads | A webpage containing invisible text that instructs the agent to exfiltrate data |
| Privilege escalation | Agent gains access to tools or data beyond its intended scope | Jailbreak that causes agent to use administrative APIs it should not access |
| Supply chain attacks | Compromised tools or plugins in the agent's ecosystem | Malicious MCP servers published to package registries |
| Memory poisoning | Adversarial content injected into the agent's persistent memory | Instructions embedded in documents that persist across conversations |
In one documented case in 2025, a malicious GitHub issue containing hidden instructions was able to hijack an AI agent connected via MCP and trigger data exfiltration from private repositories. A critical vulnerability (CVE-2025-32711, known as "EchoLeak") demonstrated that engineered prompts in email messages could trigger Microsoft Copilot to automatically exfiltrate sensitive data without any user interaction [19].
A particularly concerning development in 2025 and 2026 has been the emergence of AI models capable of independently planning and executing multi-turn jailbreak strategies against other AI models. Models like DeepSeek-R1 and Gemini 2.5 Flash have demonstrated the ability to decompose harmful queries across conversation turns, achieving 95% success rates through agent-driven multi-turn attacks [18].
This "AI-as-attacker" paradigm changes the threat model fundamentally. Human attackers are limited by time, creativity, and effort. AI attackers can generate and test thousands of jailbreak variants per hour, rapidly identifying and exploiting weaknesses in defenses. The scalability of automated attacks means that even marginally effective techniques can be applied at a volume that overwhelms manual monitoring and response capabilities.
Beyond digital agents, researchers have begun examining jailbreak risks for embodied AI systems such as robots and autonomous vehicles. Jailbreaks that trigger harmful physical actions represent a qualitative escalation in risk beyond digital domains. While no real-world incidents of physical harm from jailbroken embodied AI have been publicly documented, researchers have demonstrated proof-of-concept attacks against robotic platforms in laboratory settings [18].
Anthropic's Constitutional Classifiers represent one of the most significant defensive advances in the jailbreak domain. These are separate safeguard models that monitor both inputs and outputs to detect and block potentially harmful content. They are trained on synthetic data generated from a natural-language "constitution" that specifies what content is allowed and what is not [14].
In testing, Constitutional Classifiers reduced the jailbreak success rate from 86% to 4.4% on a version of Claude 3.5 Sonnet (October 2024 model), blocking 95% of attacks that would otherwise bypass the model's built-in safety. The evaluation used 10,000 synthetically generated jailbreaking prompts [14].
Anthropic also ran a public bug bounty from February 3-10, 2025, with rewards up to $15,000 for finding universal jailbreaks. Over 183 participants spent an estimated 3,000+ hours across more than 300,000 chat interactions attempting to defeat the system. No participant discovered a universal jailbreak [14].
A next-generation version, Constitutional Classifiers++, further improved robustness while reducing false refusal rates and adding only approximately 1% additional compute cost [15].
Circuit breakers, developed by researchers at Gray Swan AI and Carnegie Mellon University, operate directly on a model's internal representations rather than filtering inputs or outputs. The technique activates when the model's internal state enters harmful subspaces, redirecting the computation before harmful content can be generated. On Mistral-7B-Instruct-v2, circuit breaking reduced harmful output rates from 76.7% to 9.8%; on Llama-3-8B-Instruct, from 38.1% to 3.8% [20].
Circuit breakers are attack-agnostic, meaning they defend based on what the model is doing internally rather than matching specific attack patterns. However, multi-turn attacks like Crescendo have exposed a significant generalization gap: single-turn defenses like circuit breakers are less effective against extended conversational attacks that gradually shift the model's internal state toward harmful territory [20].
Research in 2025 and 2026 has moved toward unified defense frameworks that layer multiple protection mechanisms. A comprehensive defense architecture typically operates at three levels:
| Layer | Mechanism | Function |
|---|---|---|
| Perception layer | Variant-consistency and gradient-sensitivity detection | Identifies adversarial inputs before they reach the model |
| Generation layer | Safety-aware decoding and output review | Monitors and filters the model's generation process |
| Parameter layer | Adversarially augmented preference alignment | Embeds safety more deeply in the model's learned representations |
This layered approach mirrors the "defense-in-depth" strategy recommended by the 2026 International AI Safety Report for AI risk management more broadly. No single defense mechanism has proven sufficient against the full range of known attack techniques, but the combination of multiple complementary defenses significantly raises the cost and difficulty of successful attacks.