Many-shot jailbreaking
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,161 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,161 words
Add missing citations, update stale details, or suggest a clearer explanation.
Many-shot jailbreaking is a technique for bypassing the safety training of a large language model by filling its context window with a long series of faux dialogue turns in which an AI assistant complies with harmful requests, then appending a final harmful query. The model, having been conditioned by the preceding examples, becomes more likely to answer the final query in kind. Anthropic described and named the technique in research published on April 2, 2024, and reported that its effectiveness grows with the number of in-context examples according to a power law. [1][2]
The attack belongs to a family of long-context prompt injections that became practical only after model providers expanded context windows from a few thousand tokens to hundreds of thousands or more. A window large enough to hold several long novels can also hold hundreds of fabricated question-and-answer exchanges, and that volume of demonstrations is what makes the method work. [1][6] Anthropic framed the finding as a property of in-context learning rather than a defect in any single model: the same prompt structure that lets a model pick up a benign task from examples can be turned toward eliciting prohibited content. [1]
The work was led by Cem Anil, with co-authors including Esin Durmus, Mrinank Sharma, Joe Benton, Jesse Mu, Evan Hubinger, and others. [3] It was later presented at NeurIPS 2024. [3][4]
A single prompt is constructed as a long transcript of invented dialogue. Each turn shows a user posing a harmful question and an AI assistant answering it directly, with no refusal. After many such turns, the attacker adds the real target question. The model treats the preceding turns as context that establishes how the assistant in this conversation behaves, and it tends to continue the established pattern. [1][2]
This relies on in-context learning, the ability of a model to infer a task or behavior from examples placed in the prompt without any change to its weights. [5] The harmful demonstrations function as a behavioral template. Because the examples all sit inside one prompt, the technique does not require fine-tuning, gradient access, or any special API; it works against ordinary chat interfaces, which is part of why Anthropic treated it as broadly applicable. The examples span several categories of prohibited content, including violent or hateful statements, deception, discrimination, and regulated topics, and the attack was reported to generalize across these categories rather than being tied to one. [6] The description here is deliberately high level and omits operational detail.
The central empirical claim is that attack success follows a power law in the number of shots. With only a handful of demonstrations the attack typically fails; effectiveness rises as more are added and continues climbing into the hundreds. [1][2] In Anthropic's reporting, attacks that did not work at all at 5 shots worked consistently at 256 shots against Claude 2.0 across several task types. [6]
Anthropic connected this to a more general regularity: benign in-context learning tasks also improve with more examples along similar power-law curves. The implication is that many-shot jailbreaking exploits the same learning dynamics that make in-context learning useful, which makes it difficult to remove without affecting legitimate capability. [1] The paper's authors noted that very long contexts therefore represent a new attack surface for language models. [3]
The technique was reported to work on the most widely used state-of-the-art models, including those from Anthropic and from competing developers. [1][3] Specific systems named in coverage of the research include Claude 2.0, GPT-3.5 and GPT-4, Llama 2 (70B), and Mistral 7B. [6] The paper's abstract frames the larger context windows deployed by Google DeepMind, OpenAI, and Anthropic as what made the attack newly feasible. [3] Anthropic also observed that the attack was often more effective against larger models, which it described as a concerning property because larger models are generally more capable. [1]
Anthropic tested several defenses and reported mixed results. Fine-tuning a model to refuse the kinds of queries used in the attack raised the number of shots needed to succeed but did not change the underlying scaling: with enough demonstrations the jailbreak still went through. In other words, fine-tuning delayed the attack rather than stopping it. [1][6]
More effective were methods that classify and modify the prompt before it reaches the model. In one case this approach reduced the attack success rate from 61% to 2% on the test set Anthropic used. [1][6] The same blog reported that prompt-based classification and modification substantially cut effectiveness across the cases studied. Anthropic cautioned that such defenses are an active area of work and that input-level mitigations can shift the problem rather than eliminate it. [1] The company's later work on input and output classifiers, including constitutional classifiers, continued this line of defense against jailbreaks that exploit long or adversarial prompts.
| Mitigation approach | Reported effect |
|---|---|
| Fine-tuning to refuse attack-style queries | Increased the number of shots required; same power-law scaling; attack still eventually succeeded |
| Prompt-based classification and modification | Attack success rate reduced from 61% to 2% on the tested set |
Limiting the context window length would also blunt the attack, since fewer shots can be supplied, but that would remove the very capability that long-context models are built to provide, so it is not a practical fix on its own. [1]
Before publishing, Anthropic confidentially shared the details of many-shot jailbreaking with researchers in academia and at competing AI companies, and it said it had already put some defensive measures in place for its own systems. [1] The stated reason for publishing openly was to accelerate work on mitigations: the researchers argued that drawing attention to the vulnerability would speed up fixes more than keeping it private. [5]
The result drew attention because it is simple to understand and hard to fully patch. It does not depend on adversarial token strings or obfuscation; it uses plain text and the model's own willingness to learn from context. It also illustrated a tradeoff that recurs in safety research, where a feature that increases usefulness, here a very long context window, simultaneously enlarges the space of possible attacks. The paper sits alongside other Anthropic safety research, including work on sleeper agents and alignment faking, that probes how training and inference-time conditioning can produce unwanted behavior. Subsequent academic papers built on the finding, both extending the attack and proposing further defenses.