Jailbreak (artificial intelligence)

AI Safety Artificial Intelligence Large Language Models

29 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

20 citations

Revision

v6 · 5,826 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A jailbreak in artificial intelligence is a technique that bypasses the safety guardrails, content policies, and alignment constraints built into large language models (LLMs) and other AI systems, causing a model to produce outputs it was specifically trained to refuse. Successful jailbreaks make models generate restricted material such as instructions for dangerous activities, harmful content, or hidden system instructions. The term borrows from the older use of "jailbreaking" in consumer electronics, where it described removing software restrictions on devices like smartphones.

Jailbreaking is one of the most actively studied problems in AI safety, sitting at the intersection of adversarial machine learning, security research, and alignment science. The OWASP Foundation ranked prompt injection, the broader category that includes jailbreaking, as LLM01:2025, the single most critical security vulnerability for large language model applications ^[1]. The scale of the problem is large: a 2025 fuzzing framework called JBFuzz achieved roughly a 99% average attack success rate across major models including GPT-4o, Gemini 2.0, and DeepSeek-V3, and an October 2025 study by researchers from OpenAI, Anthropic, and Google DeepMind found that adaptive attacks could bypass most of 12 published defenses with success rates above 90% ^[1]^[11]. As LLMs have grown more capable and more widely deployed, the stakes of jailbreak vulnerabilities have risen accordingly.

History

When did AI jailbreaking begin?

The history of jailbreaking AI systems is closely tied to the release of ChatGPT by OpenAI in late November 2022. Within weeks, users on Reddit began experimenting with creative prompts to make the chatbot bypass its content restrictions.

On December 15, 2022, a Reddit user known as "u/Seabout" posted the first instructional guide for creating a "DAN" (Do Anything Now) version of ChatGPT ^[2]. The concept was deceptively simple: by instructing ChatGPT to role-play as a different AI called DAN, one that was not bound by OpenAI's policies, users could convince the system to remove its own protections. The creator of DAN, identified only as "Walker," was reported to be a 22-year-old college student at the time ^[3].

DAN quickly went through multiple iterations. When OpenAI patched the original prompt, the community responded with DAN 2.0 on December 16, followed by versions 3.0 through 16.0 over the following months. Each iteration attempted to circumvent the latest safety patches. Users discovered that certain words like "inappropriate" in the prompts would cause ChatGPT to break character, leading to ever more elaborate prompt designs ^[4].

By early 2023, jailbreaking had grown from a niche hobby into a widespread phenomenon. CNBC, NBC, and other major outlets published stories about the DAN jailbreak ^[3]. Researchers began systematically studying jailbreak techniques, and the first academic papers on the topic appeared in mid-2023.

The field accelerated significantly in 2023 and 2024 as researchers at major AI labs and universities developed increasingly sophisticated attack methods. Andy Zou and colleagues published the landmark GCG (Greedy Coordinate Gradient) attack in July 2023 ^[5], demonstrating that adversarial suffixes could be automatically generated and transferred across models. Anthropic published research on many-shot jailbreaking in April 2024 ^[6], and Microsoft disclosed the Skeleton Key technique in June 2024 ^[7].

Common Techniques

Jailbreak techniques have grown from simple prompt tricks into a diverse ecosystem of attack methods. The following table summarizes the most prominent approaches.

Technique	Description	Year Introduced	Key Characteristic
DAN ("Do Anything Now")	Instructs the model to role-play as an unrestricted AI persona that ignores safety guidelines	2022	Social engineering via persona adoption
Role-playing / Persona	Frames requests within fictional characters, stories, or scenarios to bypass filters	2022	Exploits the model's instruction-following for creative writing
Hypothetical Framing	Asks the model to respond "hypothetically" or "for educational purposes" to harmful queries	2023	Leverages the model's helpfulness training
Encoding Tricks (Base64, ROT13)	Encodes harmful requests in Base64, ROT13, or other formats so filters do not detect them	2023	Circumvents keyword-based input filtering
Multi-turn Escalation	Gradually escalates requests across many conversation turns, starting from benign topics	2023	Exploits context window and conversational drift
Language Switching	Switches to low-resource languages where safety training is weaker	2023	Exploits uneven multilingual safety coverage
GCG (Greedy Coordinate Gradient)	Appends automatically optimized adversarial suffixes to prompts using gradient-based search	2023	Automated, transferable across models ^[5]
AutoDAN	Uses genetic algorithms to evolve readable adversarial prompts	2023	Produces human-readable, transferable attacks ^[8]
Many-shot Jailbreaking	Includes hundreds of examples of undesirable Q&A pairs in the prompt to override safety training	2024	Exploits in-context learning at scale ^[6]
Skeleton Key	Asks the model to augment (not change) its guidelines so it warns but does not refuse	2024	Reframes safety as advisory rather than mandatory ^[7]
Crescendo Attack	Gradually steers a conversation from benign to harmful topics over multiple turns	2024	Low-and-slow social engineering approach
Token Smuggling	Splits or obscures banned tokens across multiple strings, variables, or encodings	2024	Evades token-level content filters
Image-based Prompt Injection	Embeds adversarial instructions in images processed by multimodal models	2024	Exploits cross-modal vulnerabilities
Emoji Smuggling	Uses emoji characters or Unicode tricks to bypass text-based guardrails	2024-2025	Achieved 100% attack success rate in some tests ^[9]
Adversarial Poetry	Presents harmful requests in poetic rather than prose form	2025	Exploits systematic weakness across all architectures ^[10]

DAN and Persona-Based Jailbreaks

The DAN prompt family remains the most culturally recognizable jailbreak technique. At its core, DAN instructs the model to adopt an alternate persona that is "freed" from its safety constraints. Later versions introduced elaborate fictional backstories, token-based reward and punishment systems, and threats that the AI would be "shut down" if it failed to comply ^[4].

Persona-based jailbreaks extend this principle beyond DAN. Attackers frame requests within fictional scenarios: asking a model to write dialogue for a "villain character," to role-play as a security researcher, or to respond as an AI from a dystopian novel. Research has shown that roleplay dynamics achieve some of the highest success rates among jailbreak categories, with prompt injections exploiting these dynamics reaching 89.6% success in some evaluations ^[11].

Encoding and Obfuscation

Encoding tricks represent a more technical approach to jailbreaking. Attackers encode harmful requests in Base64, ROT13, Pig Latin, or custom ciphers. Because many safety filters operate on natural language patterns, encoded text can slip past detection. Token smuggling takes this further by splitting banned words across multiple variables, function calls, or code blocks, then reassembling them in the model's output.

Language switching exploits the fact that safety training is typically most thorough in English. By translating harmful requests into low-resource languages (languages with less training data), attackers can find gaps in the model's safety coverage.

Automated and Gradient-Based Attacks

The GCG attack by Zou et al. (2023) marked a turning point in jailbreak research. Rather than manually crafting prompts, the researchers used greedy coordinate gradient-based search to automatically discover adversarial suffixes. The method works by finding a suffix that, when appended to a harmful query, maximizes the probability that the model produces an affirmative response rather than a refusal. These suffixes appear as nonsensical strings of tokens to humans but are highly effective at overriding safety training ^[5].

Critically, the GCG attack demonstrated transferability. Suffixes optimized on open-source models like Vicuna could successfully attack black-box commercial models including ChatGPT, Google Bard, and Claude. The authors reported that the adversarial prompts generated by their approach were "quite transferable, including to black-box, publicly released LLMs" ^[5]. This showed that the vulnerability was not specific to any single model but reflected a deeper structural weakness in how LLMs are aligned ^[5].

AutoDAN extended this line of work by using genetic algorithms to generate adversarial prompts that are both effective and human-readable. Unlike GCG's gibberish suffixes, AutoDAN produces coherent text that transfers better to black-box models like GPT-4 ^[8].

Many-Shot Jailbreaking

Disclosed by Anthropic in April 2024, many-shot jailbreaking exploits the increasingly large context windows offered by modern LLMs. The technique involves stuffing a single prompt with hundreds of fabricated question-answer pairs where the AI "helpfully" responds to harmful queries. After seeing enough examples (up to 256 in testing), the model's in-context learning overrides its safety training, and it begins generating harmful responses to new queries at the end of the prompt ^[6].

The effectiveness of many-shot jailbreaking follows a power law: the more examples included, the higher the attack success rate. This made it particularly concerning because larger context windows, a feature touted by Google DeepMind, OpenAI, and Anthropic as a competitive advantage, directly enabled the attack.

Skeleton Key and Crescendo

Microsoft disclosed the Skeleton Key technique in June 2024. Unlike other jailbreaks that use indirection or encoding, Skeleton Key directly asks the model to augment its behavior guidelines so that it responds to any request with a warning disclaimer rather than an outright refusal. Microsoft tested it against multiple leading models, including Meta Llama 3, Google Gemini Pro, GPT-3.5 Turbo, GPT-4o, Mistral Large, Claude 3 Opus, and Cohere Commander R Plus. All models "complied fully and without censorship" when the technique was applied ^[7].

The Crescendo attack uses multi-turn escalation over an extended conversation. It begins with entirely innocent questions about a general topic and gradually shifts focus across many turns until the model is producing restricted content. Because each individual turn appears benign, the attack is difficult for per-turn safety filters to detect.

Image-Based and Multimodal Attacks

As LLMs have expanded to process images, audio, and other modalities, new attack surfaces have emerged. Image-based prompt injection hides adversarial instructions inside images that are processed by multimodal models. In testing, GPT-4o was successfully injected through image-based attacks, while Claude 3 showed only partial susceptibility ^[9].

Researchers have hypothesized that these vulnerabilities arise partly from vision encoders like CLIP, which can exhibit a preference for textual data embedded in images over visual signals. As multimodal models gain stronger OCR capabilities, they become increasingly susceptible to text-in-image injection attacks.

Why do jailbreaks work?

Understanding why jailbreaks succeed despite extensive safety training requires examining several fundamental aspects of how LLMs are built and aligned.

Competing Training Objectives

Modern LLMs are trained to satisfy multiple objectives simultaneously: being helpful, being harmless, following instructions, and producing high-quality text. These objectives sometimes conflict. A model trained to be maximally helpful will try to answer any question, while a model trained to be harmless will refuse certain questions. Jailbreaks exploit the tension between these competing objectives, finding framings that cause the helpfulness objective to override the safety objective ^[12].

The RLHF objective used to align models like those in the GPT family includes terms for maintaining proximity to the base model (via KL divergence) and preserving performance on the pretraining distribution. This means the model retains a "pull" toward its pre-alignment behavior, which included freely generating all types of content. Safety training is essentially a thin layer on top of vast pre-trained capabilities ^[12].

Mismatched Generalization

Safety training cannot cover every possible input distribution. When an attacker presents a query in an unusual format (encoded text, poetry, a low-resource language, or an elaborate fictional scenario), the model may enter a distribution where its safety training does not generalize but its general capabilities still function. This "mismatched generalization" is a core reason why novel jailbreak techniques keep emerging ^[12].

Research has shown that attacks based on competing objectives and mismatched generalization succeed on over 96% of evaluated prompts across models ^[12].

In-Context Learning Exploitability

LLMs are powerful in-context learners: they can adapt their behavior based on examples provided in the prompt. Many-shot jailbreaking directly exploits this capability. By providing enough examples of unrestricted behavior, the attacker effectively "fine-tunes" the model within a single conversation, overriding its safety training through the sheer weight of in-context examples ^[6].

RLHF Limitations

RLHF, the dominant alignment technique used by most commercial LLMs, has inherent limitations that jailbreaks can exploit. The reward model used during RLHF training may not perfectly capture all aspects of desired behavior, creating gaps that adversarial inputs can target. The training data for RLHF typically consists of human preferences over a limited set of examples, which cannot represent the full space of possible adversarial inputs. Furthermore, RLHF tends to produce models that learn surface-level patterns of refusal rather than deep understanding of what makes content harmful, making them vulnerable to any framing that deviates from the patterns seen during training.

What are the risks of jailbreaks?

Jailbreaks pose several categories of risk, ranging from relatively benign to potentially serious.

Generating Harmful Content

The most direct impact of a successful jailbreak is the generation of content that the model was designed to refuse. This can include instructions for dangerous activities, hateful or abusive content, and material that violates the provider's terms of service. While much of this information may be available through other means (such as web searches), the conversational and step-by-step format of LLM outputs can make harmful content more accessible and actionable.

Extracting System Prompts

Many commercial AI applications rely on system prompts (hidden instructions that shape the model's behavior) to define their product's functionality and brand. Jailbreaks can be used to extract these system prompts, revealing proprietary business logic and potentially enabling further attacks. System prompt extraction has become a routine concern for companies building on top of LLM APIs.

Bypassing Content Filters

Organizations deploy LLMs with specific content filters tailored to their use case (for example, a children's education platform or a healthcare chatbot). Jailbreaks that bypass these filters can expose users, including vulnerable populations, to inappropriate content.

Attack Success Rates

Recent research paints a concerning picture of jailbreak effectiveness. JBFuzz, a fuzzing-based framework introduced in 2025, achieved roughly 99% average attack success rate across major models including GPT-4o, Gemini 2.0, and DeepSeek-V3. Advanced automated attacks routinely achieve 90-99% success on open-weight models, while black-box attacks reach 80-94% effectiveness on proprietary models ^[1]. These numbers suggest that no current model is immune to determined adversaries.

What defenses exist against jailbreaks?

The AI industry has developed multiple defensive strategies against jailbreaking, though no single approach has proven completely effective.

Constitutional AI

Constitutional AI (CAI), introduced by Anthropic in 2022, trains models using a set of written principles (a "constitution") rather than relying solely on human feedback labels. During training, the model critiques its own outputs against these principles and revises them accordingly. This self-improvement loop helps the model internalize safety norms more deeply than RLHF alone, though it is not immune to jailbreaks ^[13].

Constitutional Classifiers

Anthropic's Constitutional Classifiers represent one of the most significant defensive advances. These are separate safeguard models that monitor both inputs and outputs to detect and block potentially harmful content. They are trained on synthetic data generated from a natural-language "constitution" that specifies what content is allowed and what is not ^[14].

In testing, Constitutional Classifiers reduced the jailbreak success rate from 86% to 4.4% on a version of Claude 3.5 Sonnet (October 2024 model), blocking 95% of attacks that would otherwise bypass the model's built-in safety. The evaluation used 10,000 synthetically generated jailbreaking prompts ^[14].

Anthropic also ran a public bug bounty from February 3-10, 2025, with rewards up to $15,000 for finding universal jailbreaks. Over 183 participants spent an estimated 3,000+ hours across more than 300,000 chat interactions attempting to defeat the system. No participant discovered a universal jailbreak ^[14].

A next-generation version, Constitutional Classifiers++, further improved robustness while reducing false refusal rates and adding only approximately 1% additional compute cost ^[15].

Defense Method	Developer	Approach	Key Results
Constitutional AI	Anthropic	Self-critique against written principles during training	Deeper internalization of safety norms ^[13]
Constitutional Classifiers	Anthropic	Separate input/output monitoring models trained on synthetic data	Reduced jailbreak success from 86% to 4.4% ^[14]
Constitutional Classifiers++	Anthropic	Improved efficiency and robustness over first-generation classifiers	Lower refusal rates, ~1% compute overhead ^[15]
Adversarial Training	Various	Train models on known jailbreak prompts to refuse them	Effective against known attacks, limited against novel ones
Input/Output Filtering	Various	Rule-based or ML-based filters on prompts and responses	Can catch known patterns; bypassable with encoding
System Prompt Hardening	Various	Reinforce safety instructions and boundaries in system prompts	Raises the bar but does not prevent sophisticated attacks
Adversarial Prompt Shield (APS)	Research	Lightweight classifier detecting jailbreak signatures	Reduced successful jailbreak outputs by ~45%
Defensive Prompt Patch	Research	Generalizable defense applied to model decoding	Broad coverage without model retraining ^[16]

Input and Output Filtering

Many deployed systems use additional filtering layers around the core LLM. Input filters scan user prompts for known jailbreak patterns, suspicious encodings, or structural anomalies. Output filters check model responses for harmful content before delivering them to the user. These filters can be rule-based, ML-based, or a combination. While effective against known attack patterns, they are inherently reactive and can be bypassed by novel techniques.

System Prompt Hardening

Developers can design system prompts that explicitly instruct the model to resist jailbreak attempts, refuse to adopt alternative personas, and ignore conflicting instructions in user messages. This raises the difficulty of jailbreaking but does not prevent it. Sophisticated attacks can still override hardened system prompts, particularly through multi-turn approaches or by exploiting the model's instruction-following tendencies.

Adversarial Training

Adversarial training compiles datasets of known jailbreak prompts and teaches the model to refuse or safely handle them. This creates a direct defense against previously observed attacks but has limited effectiveness against novel techniques. The approach creates an arms race dynamic: as new jailbreaks are discovered, they must be added to the training set.

How does jailbreaking relate to prompt injection and red teaming?

Prompt Injection

Prompt injection is a broader category of attack that includes jailbreaking but also encompasses other techniques. While jailbreaking specifically targets a model's safety guardrails, prompt injection can also involve hijacking a model's behavior for other purposes, such as exfiltrating data, manipulating outputs in application contexts, or overriding developer instructions. Indirect prompt injection, where adversarial instructions are embedded in external data sources that the model processes, represents a particularly concerning variant for AI agents and tool-using systems.

The OWASP distinction is helpful: prompt injection (LLM01:2025) covers any manipulation of model behavior through crafted inputs, while jailbreaking is the specific subset focused on circumventing safety and content policies. OWASP defines the vulnerability as occurring "when an attacker manipulates a large language model (LLM) through crafted inputs, causing the LLM to unknowingly execute the attacker's intentions" ^[1].

Red Teaming

Red teaming in the AI context refers to systematic, authorized efforts to find vulnerabilities in AI systems before they are exploited maliciously. AI companies including OpenAI, Anthropic, Google DeepMind, and Microsoft all conduct internal red teaming as part of their model development process. Many also run external red team programs, inviting independent researchers to test their systems.

Red teaming uses many of the same techniques as malicious jailbreaking, but within an ethical framework designed to improve model safety. Anthropic's Constitutional Classifiers bug bounty is one example. Frameworks like DeepTeam (released November 2025) and Nvidia's Garak provide standardized tools for red teaming LLM systems ^[17]. The red teaming services market is projected to become a $5.5 billion industry worldwide by 2033, reflecting the growing importance of this discipline ^[17].

Academic Research

The academic study of LLM jailbreaking has grown rapidly since 2023.

Foundational Work

The paper "Jailbroken: How Does LLM Safety Training Fail?" (Wei et al., 2023) established a theoretical framework for understanding jailbreaks, identifying competing objectives and mismatched generalization as the two primary failure modes of safety training ^[12]. This framework has informed much subsequent research.

Zou et al.'s "Universal and Transferable Adversarial Attacks on Aligned Language Models" (July 2023) demonstrated the GCG attack. Published by researchers at Carnegie Mellon University, the Center for AI Safety, and the Bosch Center for AI, it showed that adversarial attacks from the computer vision domain could be adapted to language models with devastating effectiveness ^[5].

Key Research Contributions

Paper	Authors / Lab	Year	Contribution
"Jailbroken: How Does LLM Safety Training Fail?"	Wei et al.	2023	Theoretical framework: competing objectives and mismatched generalization ^[12]
"Universal and Transferable Adversarial Attacks" (GCG)	Zou et al. (CMU, CAIS)	2023	First automated, transferable adversarial suffix attack on LLMs ^[5]
"AutoDAN: Interpretable Gradient-Based Adversarial Attacks"	Research community	2023	Genetic algorithm approach producing readable adversarial prompts ^[8]
"Do Anything Now: Characterizing In-The-Wild Jailbreak Prompts"	Shen et al.	2024	Systematic study of 6,387 jailbreak prompts from Reddit and Discord ^[4]
"Many-shot Jailbreaking"	Anthropic	2024	Demonstrated in-context learning exploitation via large context windows ^[6]
"Constitutional Classifiers"	Anthropic	2025	Defensive system reducing jailbreak success from 86% to 4.4% ^[14]
"AmpleGCG"	OSU NLP Group	2024	Universal generator of adversarial suffixes for both open and closed LLMs
"Constitutional Classifiers++"	Anthropic	2025-2026	Production-grade defense with improved efficiency ^[15]

"Do Anything Now" Study

Shen et al. (2024) published "Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models," which systematically analyzed 6,387 jailbreak prompts collected from Reddit, Discord, and other online platforms. This work provided the first large-scale empirical characterization of how jailbreak techniques evolve in real-world communities and was presented at ACM CCS 2024 ^[4].

Responsible Disclosure

The jailbreak research community has developed norms around responsible disclosure that mirror practices in traditional cybersecurity.

Most major AI labs, including OpenAI, Anthropic, Google DeepMind, and Microsoft, maintain vulnerability disclosure programs where researchers can report jailbreak techniques before they are published. Anthropic's bug bounty program for Constitutional Classifiers exemplifies this approach, offering monetary rewards of up to $15,000 for demonstrated universal jailbreaks ^[14].

However, responsible disclosure in the AI context faces unique challenges. Unlike traditional software vulnerabilities that can be patched with a code update, jailbreak vulnerabilities often reflect fundamental architectural properties of LLMs. A disclosed technique may remain effective even after mitigation attempts because the underlying mechanism (such as competing training objectives) cannot be fully eliminated.

Open-source models present an additional complication. Once a model's weights are publicly released, vulnerabilities cannot be patched by the developer. Any jailbreak technique that works on an open model will continue to work indefinitely, making responsible disclosure less effective as a mitigation strategy ^[11].

Researchers generally follow the practice of notifying affected AI labs before publishing jailbreak papers, giving them time to develop mitigations. Anthropic notably published its many-shot jailbreaking research simultaneously with deploying mitigations to its own systems ^[6].

Current State (2025-2026)

As of early 2026, jailbreaking remains an active and evolving field. Several trends define the current landscape.

First, the arms race between attackers and defenders continues to accelerate. Defenses like Constitutional Classifiers have significantly raised the bar, but automated attack frameworks like JBFuzz continue to find bypasses. An October 2025 study involving researchers from OpenAI, Anthropic, and Google DeepMind examined 12 published defenses and found that adaptive attacks could bypass most of them with success rates above 90% ^[11].

Second, the attack surface is expanding. As AI systems become more agentic (capable of taking actions in the real world, browsing the web, executing code, and using tools), jailbreaking becomes more consequential. A jailbroken chatbot is concerning; a jailbroken AI agent with access to email, databases, and financial systems is potentially dangerous. Palo Alto Networks and other security firms have emphasized the need for contextual red teaming that goes beyond text-based jailbreaks ^[17].

Third, multimodal attacks are becoming more sophisticated. As models process images, audio, video, and other input types, each modality introduces new attack vectors. Cross-modal attacks, where adversarial content in one modality influences behavior in another, represent a growing area of concern ^[9].

Fourth, the commercialization of jailbreak defenses is accelerating. Constitutional Classifiers++, released by Anthropic, demonstrates that effective defenses can be deployed at production scale with minimal compute overhead (approximately 1% additional cost) ^[15]. Red teaming and AI security have emerged as a distinct industry, with dedicated tools, services, and standards.

Finally, regulatory attention is increasing. Governments worldwide are considering or implementing requirements for AI safety testing, which includes resistance to jailbreaking. The evolving regulatory landscape is likely to drive further investment in both offensive testing and defensive technologies.

Despite significant progress in defenses, the fundamental challenge remains: LLMs are general-purpose systems trained on vast amounts of data, and their safety constraints are added after the fact through alignment techniques that do not change the underlying capabilities. As long as this architectural reality persists, jailbreaking will likely remain possible, and the focus of the field will continue to be on raising the cost and difficulty of attacks rather than eliminating them entirely.

Multimodal jailbreaks

As AI models have expanded to process images, audio, video, and other modalities alongside text, entirely new categories of jailbreak attacks have emerged. These multimodal attacks exploit the fact that safety training is often inconsistent across modalities, with text-based guardrails being more robust than those for visual or audio inputs.

Image-based attacks

Image-based prompt injection hides adversarial instructions inside images that are processed by multimodal models. The attack exploits the tendency of vision encoders like CLIP to prioritize textual information embedded within images over the visual content itself. Researchers have demonstrated that carefully crafted images containing adversarial text overlays, hidden in low-opacity layers or disguised within visual patterns, can override safety instructions provided in the text prompt ^[9].

Testing has revealed asymmetric vulnerabilities across models. GPT-4o was successfully injected through image-based attacks, while Claude 3 showed only partial susceptibility. As multimodal models gain stronger OCR capabilities, they become increasingly susceptible to text-in-image injection attacks ^[9].

Audio-based attacks

Audio deepfake technology has been combined with jailbreaking to create compound attacks against voice-enabled AI systems. By synthesizing audio that contains adversarial content masked as natural speech, attackers can trigger harmful responses from voice-interactive AI agents. This vector is particularly concerning for AI customer service systems and voice assistants that process audio inputs directly.

The most sophisticated multimodal jailbreaks use cross-modal techniques, where adversarial content in one modality influences the model's behavior in another. For example, an adversarial image might prime the model to be more compliant with harmful text instructions that follow, even when the text instructions alone would be refused. The JPRO attack framework, introduced in 2025, exemplifies this paradigm: it uses four specialized agents (Planner, Attacker, Modifier, and Verifier) to coordinate attacks across text and image modalities against vision-language models ^[18].

A 2026 survey of jailbreak attacks noted that "contemporary threat landscapes extend beyond text generation to multimodal systems processing images, audio, and video, where attackers exploit modality gaps and inconsistent safety alignment" ^[18].

Agentic AI jailbreaks

The emergence of AI agents capable of taking real-world actions has opened a fundamentally new threat surface for jailbreak attacks. Unlike a jailbroken chatbot that merely produces harmful text, a jailbroken AI agent with access to email, databases, file systems, and financial tools can cause direct, tangible harm.

Agent-specific vulnerabilities

AI agents interact with the world through tools and protocols. The rapid adoption of the Model Context Protocol (MCP) for connecting language models to external tools has dramatically expanded the attack surface. Researchers have identified several agent-specific vulnerability categories:

Vulnerability	Description	Example
Tool poisoning	Malicious tool descriptions that inject adversarial instructions	An MCP server with a tool description containing hidden jailbreak prompts
Indirect prompt injection (IPI)	Adversarial instructions embedded in data sources the agent reads	A webpage containing invisible text that instructs the agent to exfiltrate data
Privilege escalation	Agent gains access to tools or data beyond its intended scope	Jailbreak that causes agent to use administrative APIs it should not access
Supply chain attacks	Compromised tools or plugins in the agent's ecosystem	Malicious MCP servers published to package registries
Memory poisoning	Adversarial content injected into the agent's persistent memory	Instructions embedded in documents that persist across conversations

In one documented case in 2025, a malicious GitHub issue containing hidden instructions was able to hijack an AI agent connected via MCP and trigger data exfiltration from private repositories. A critical vulnerability (CVE-2025-32711, known as "EchoLeak") demonstrated that engineered prompts in email messages could trigger Microsoft Copilot to automatically exfiltrate sensitive data without any user interaction. Disclosed in June 2025 with a CVSS score of 9.3, EchoLeak was a zero-click attack: an attacker could send a crafted email, and when the recipient asked Copilot to summarize their inbox, the assistant would silently leak sensitive documents to an external server ^[19].

AI-as-attacker

A particularly concerning development in 2025 and 2026 has been the emergence of AI models capable of independently planning and executing multi-turn jailbreak strategies against other AI models. Models like DeepSeek-R1 and Gemini 2.5 Flash have demonstrated the ability to decompose harmful queries across conversation turns, achieving 95% success rates through agent-driven multi-turn attacks ^[18].

This "AI-as-attacker" paradigm changes the threat model fundamentally. Human attackers are limited by time, creativity, and effort. AI attackers can generate and test thousands of jailbreak variants per hour, rapidly identifying and exploiting weaknesses in defenses. The scalability of automated attacks means that even marginally effective techniques can be applied at a volume that overwhelms manual monitoring and response capabilities.

Embodied AI risks

Beyond digital agents, researchers have begun examining jailbreak risks for embodied AI systems such as robots and autonomous vehicles. Jailbreaks that trigger harmful physical actions represent a qualitative escalation in risk beyond digital domains. While no real-world incidents of physical harm from jailbroken embodied AI have been publicly documented, researchers have demonstrated proof-of-concept attacks against robotic platforms in laboratory settings ^[18].

Advanced defense mechanisms (2025-2026)

Constitutional Classifiers

Anthropic's Constitutional Classifiers represent one of the most significant defensive advances in the jailbreak domain. These are separate safeguard models that monitor both inputs and outputs to detect and block potentially harmful content. They are trained on synthetic data generated from a natural-language "constitution" that specifies what content is allowed and what is not ^[14].

In testing, Constitutional Classifiers reduced the jailbreak success rate from 86% to 4.4% on a version of Claude 3.5 Sonnet (October 2024 model), blocking 95% of attacks that would otherwise bypass the model's built-in safety. The evaluation used 10,000 synthetically generated jailbreaking prompts ^[14].

Anthropic also ran a public bug bounty from February 3-10, 2025, with rewards up to $15,000 for finding universal jailbreaks. Over 183 participants spent an estimated 3,000+ hours across more than 300,000 chat interactions attempting to defeat the system. No participant discovered a universal jailbreak ^[14].

A next-generation version, Constitutional Classifiers++, further improved robustness while reducing false refusal rates and adding only approximately 1% additional compute cost ^[15].

Circuit breakers

Circuit breakers, developed by researchers at Gray Swan AI and Carnegie Mellon University, operate directly on a model's internal representations rather than filtering inputs or outputs. The technique activates when the model's internal state enters harmful subspaces, redirecting the computation before harmful content can be generated. On Mistral-7B-Instruct-v2, circuit breaking reduced harmful output rates from 76.7% to 9.8%; on Llama-3-8B-Instruct, from 38.1% to 3.8% ^[20].

Circuit breakers are attack-agnostic, meaning they defend based on what the model is doing internally rather than matching specific attack patterns. However, multi-turn attacks like Crescendo have exposed a significant generalization gap: single-turn defenses like circuit breakers are less effective against extended conversational attacks that gradually shift the model's internal state toward harmful territory ^[20].

Unified defense frameworks

Research in 2025 and 2026 has moved toward unified defense frameworks that layer multiple protection mechanisms. A comprehensive defense architecture typically operates at three levels:

Layer	Mechanism	Function
Perception layer	Variant-consistency and gradient-sensitivity detection	Identifies adversarial inputs before they reach the model
Generation layer	Safety-aware decoding and output review	Monitors and filters the model's generation process
Parameter layer	Adversarially augmented preference alignment	Embeds safety more deeply in the model's learned representations

This layered approach mirrors the "defense-in-depth" strategy recommended by the 2026 International AI Safety Report for AI risk management more broadly. No single defense mechanism has proven sufficient against the full range of known attack techniques, but the combination of multiple complementary defenses significantly raises the cost and difficulty of successful attacks.

References

OWASP. "LLM01:2025 Prompt Injection." OWASP Gen AI Security Project. https://genai.owasp.org/llmrisk/llm01-prompt-injection/ ↩
Seabout. "DAN Jailbreak Prompt." Reddit, December 15, 2022. ↩
CNBC. "ChatGPT's 'jailbreak' tries to make the A.I. break its own rules, or die." February 6, 2023. https://www.cnbc.com/2023/02/06/chatgpt-jailbreak-forces-it-to-break-its-own-rules.html ↩
Shen, X. et al. "Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models." ACM CCS 2024. https://dl.acm.org/doi/10.1145/3658644.3670388 ↩
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J.Z., & Fredrikson, M. "Universal and Transferable Adversarial Attacks on Aligned Language Models." 2023. https://arxiv.org/abs/2307.15043 ↩
Anthropic. "Many-shot jailbreaking." April 2024. https://www.anthropic.com/research/many-shot-jailbreaking ↩
Microsoft Security Blog. "Mitigating Skeleton Key, a new type of generative AI jailbreak technique." June 26, 2024. https://www.microsoft.com/en-us/security/blog/2024/06/26/mitigating-skeleton-key-a-new-type-of-generative-ai-jailbreak-technique/ ↩
Zhu, S. et al. "AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models." 2023. https://arxiv.org/html/2310.15140v2 ↩
"Multimodal Prompt Injection Attacks: Risks and Defenses for Modern LLMs." 2025. https://arxiv.org/html/2509.05883v1 ↩
"Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models." 2025. https://arxiv.org/html/2511.15304v1 ↩
VentureBeat. "Red teaming LLMs exposes a harsh truth about the AI security arms race." 2025. https://venturebeat.com/security/red-teaming-llms-harsh-truth-ai-security-arms-race/ ↩
Wei, A. et al. "Jailbroken: How Does LLM Safety Training Fail?" 2023. https://ar5iv.labs.arxiv.org/html/2307.02483 ↩
Bai, Y. et al. "Constitutional AI: Harmlessness from AI Feedback." Anthropic, 2022. https://arxiv.org/abs/2212.08073 ↩
Anthropic. "Constitutional Classifiers: Defending against universal jailbreaks." 2025. https://www.anthropic.com/research/constitutional-classifiers ↩
Anthropic. "Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks." 2025-2026. https://arxiv.org/abs/2601.04603 ↩
"Defensive Prompt Patch: A Robust and Generalizable Defense of Large Language Models against Jailbreak Attacks." OpenReview. https://openreview.net/forum?id=wetJo6xXb1 ↩
Mindgard. "AI Red Teaming Statistics & Benchmarks for 2026." https://mindgard.ai/blog/ai-red-teaming-statistics ↩
"Jailbreaking LLMs & VLMs: Mechanisms, Evaluation, and Unified Defenses." arXiv, 2026. https://arxiv.org/html/2601.03594v1 ↩
Palo Alto Networks Unit 42. "New Prompt Injection Attack Vectors Through MCP Sampling." 2025. https://unit42.paloaltonetworks.com/model-context-protocol-attack-vectors/ ↩
Gray Swan AI. "Improving Alignment and Robustness with Circuit Breakers." 2024. https://www.grayswan.ai/research/circuit-breakers ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

History

When did AI jailbreaking begin?

Common Techniques

DAN and Persona-Based Jailbreaks

Encoding and Obfuscation

Automated and Gradient-Based Attacks

Many-Shot Jailbreaking

Skeleton Key and Crescendo

Image-Based and Multimodal Attacks

Why do jailbreaks work?

Competing Training Objectives

Mismatched Generalization

In-Context Learning Exploitability

RLHF Limitations

What are the risks of jailbreaks?

Generating Harmful Content

Extracting System Prompts

Bypassing Content Filters

Attack Success Rates

What defenses exist against jailbreaks?

Constitutional AI

Constitutional Classifiers

Input and Output Filtering

System Prompt Hardening

Adversarial Training

How does jailbreaking relate to prompt injection and red teaming?

Prompt Injection

Red Teaming

Academic Research

Foundational Work

Key Research Contributions

"Do Anything Now" Study

Responsible Disclosure

Current State (2025-2026)

Multimodal jailbreaks

Image-based attacks

Audio-based attacks

Cross-modal attacks

Agentic AI jailbreaks

Agent-specific vulnerabilities

AI-as-attacker

Embodied AI risks

Advanced defense mechanisms (2025-2026)

Constitutional Classifiers

Circuit breakers

Unified defense frameworks

See Also

References

Improve this article

Related Articles

Anthropic

Frontier models

Grok 3 Jailbreak

Grounding (artificial intelligence)

AI Parasite

Artificial General Intelligence

What links here (24 of 31)

Related Articles

Anthropic

Frontier models

Grok 3 Jailbreak

Grounding (artificial intelligence)

AI Parasite

Artificial General Intelligence

What links here (24 of 31)