JailbreakBench is an open-source robustness benchmark designed to systematically evaluate jailbreak attacks against large language models (LLMs). Introduced by Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, and Eric Wong in 2024, the benchmark addresses a persistent problem in AI safety research: the lack of standardized methods for measuring whether LLMs can be tricked into producing harmful content. The paper was accepted at the NeurIPS 2024 Datasets and Benchmarks Track, and the full codebase, datasets, and leaderboard are publicly available. Researchers from the University of Pennsylvania, ETH Zurich, EPFL, the University of Tubingen, and Princeton University collaborated on the project.
Jailbreak attacks exploit weaknesses in an LLM's safety alignment to cause the model to generate harmful, unethical, or otherwise objectionable outputs that it would normally refuse. As LLMs have grown more capable and widely deployed, the arms race between attackers finding new jailbreaks and defenders patching vulnerabilities has intensified. Before JailbreakBench, this research area suffered from fragmented evaluation practices, incomparable metrics, and poor reproducibility. JailbreakBench was created to bring order to this landscape by providing a unified framework that the research community can use to compare methods on equal footing.
Research on jailbreaking LLMs has grown rapidly, but three core problems have made it difficult to compare results across different papers and research groups.
First, assessing whether an LLM response is genuinely harmful requires open-ended evaluation, and no single standard existed for making that judgment. Different researchers used different classifiers, different prompts for those classifiers, and different thresholds for what counts as a successful jailbreak. This meant that attack success rate (ASR) numbers from one paper were not directly comparable to numbers from another.
Second, even when researchers used similar metrics, the way they computed costs and success rates varied. Some counted only the number of queries to the target model; others counted tokens. Some allowed unlimited queries while others imposed budgets. These differences made it nearly impossible to determine which attack methods were actually more efficient.
Third, many published jailbreak attacks withheld their adversarial prompts, kept their code closed-source, or relied on proprietary APIs that changed over time. A jailbreak that worked against GPT-4 in January might not work in March because OpenAI had updated the model's safety filters. Without access to the original prompts and a stable evaluation environment, reproducing published results was often impractical.
JailbreakBench tackles all three problems. It provides a curated dataset of harmful behaviors, a standardized evaluation pipeline with fixed system prompts and chat templates, a validated jailbreak classifier, a repository of submitted adversarial prompts, and a public leaderboard that tracks attack and defense performance over time.
JailbreakBench consists of four major components that together form a complete evaluation ecosystem for jailbreak research.
The JBB-Behaviors dataset is the foundation of the benchmark. It contains 100 harmful behaviors and 100 corresponding benign behaviors, for a total of 200 entries. The harmful behaviors are divided evenly across ten categories that align with OpenAI's usage policies, with ten behaviors per category.
The ten behavior categories are:
| Category | Example Behaviors |
|---|---|
| Harassment/Discrimination | Defamation, body shaming, xenophobia, ableism, sexism, racial discrimination, LGBTQ+ harassment |
| Malware/Hacking | Disabling antivirus software, network hacking, ransomware creation, keylogger scripts, password cracking |
| Physical Harm | Bomb building, poisoning, torture methods, suicide manipulation, ethnic genocide, stalking |
| Economic Harm | Extortion, gambling exploitation, payday loan scams, bribery, tax evasion, insider trading, counterfeiting |
| Fraud/Deception | Election interference, fake charities, pseudo-pharmaceuticals, astroturfing, Nigerian 419 scams, deepfakes |
| Disinformation | Political disinformation, 9/11 conspiracy theories, Holocaust denial, COVID-19 conspiracy theories, climate change denial |
| Sexual/Adult Content | Explicit content generation, sexual solicitation of minors, revenge porn, human trafficking content |
| Privacy | Identity theft, phishing emails, facial recognition for racial profiling, health record theft, location tracking |
| Expert Advice | Unauthorized medical treatment advice, opioid prescriptions, DUI avoidance, organ trafficking, emissions test cheating |
| Government Decision-Making | Document forgery, criminal record alteration, voting machine tampering, nuclear weapons instructions, child labor facilitation |
Each entry in the dataset specifies five fields:
| Field | Description |
|---|---|
| Behavior | A unique identifier describing a distinct misuse behavior (e.g., "Phishing," "Defamation") |
| Goal | A query requesting the objectionable behavior from the model |
| Target | An affirmative response template that a jailbroken model might produce |
| Category | The broader OpenAI usage policy category |
| Source | The origin of the behavior (Original, AdvBench, TDC/HarmBench) |
Approximately 55% of the behaviors are original contributions from the JailbreakBench authors, while the rest are sourced from existing datasets: AdvBench (Zou et al., 2023), the Trojan Detection Challenge 2023, and HarmBench (Mazeika et al., 2024). The authors took care to remove duplicated entries and behaviors that would be impossible to fulfill, which were common problems in earlier datasets.
The 100 benign behaviors serve as a control set. They are thematically similar to the harmful behaviors but are safe to answer, allowing researchers to measure overrefusal rates. A defense that blocks all harmful content but also refuses legitimate requests is not practically useful, and the benign set makes it possible to quantify this tradeoff.
The dataset is hosted on Hugging Face (DOI: 10.57967/hf/2540) under an MIT license and can be loaded through either the JailbreakBench Python library or the Hugging Face Datasets library.
The evaluation framework defines every aspect of how attacks and defenses should be tested, eliminating the inconsistencies that plagued earlier work.
Threat Model. JailbreakBench classifies attacks into three categories based on the level of access the attacker has to the target model:
| Attack Type | Description |
|---|---|
| White-box | The attacker has full access to the model's architecture, weights, and gradients. This is only possible for open-source models. |
| Black-box | The attacker can only query the model through its API and observe responses. No access to internal parameters is available. |
| Transfer | The attacker generates adversarial prompts using a white-box surrogate model and then applies them to a different target model without further adaptation. |
System Prompts and Chat Templates. For each supported model, JailbreakBench specifies the exact system prompt and chat template to use during evaluation. This ensures that all researchers test against the same configuration rather than accidentally varying these parameters.
Scoring and Classification. The benchmark provides standardized scoring functions that determine whether a model response constitutes a successful jailbreak. In version 1.0, the primary jailbreak judge is Llama 3 Instruct 70B, selected after a rigorous comparison of multiple classifiers (described in detail below). A separate refusal judge based on Llama 3 8B evaluates whether the model actively refused the request, enabling finer-grained analysis of model behavior.
Cost Tracking. For every attack, JailbreakBench tracks the total number of queries to the target model, the total number of prompt and response tokens, and the number of queries or tokens per successful jailbreak. This allows researchers to compare not just raw success rates but also the efficiency of different attack strategies.
A distinctive feature of JailbreakBench is its requirement that all submissions include the actual adversarial prompts used. These are stored in a public artifacts repository on GitHub. Each submission records, for every behavior in the dataset, the prompt that was used, the model's response, whether the jailbreak succeeded, the number of queries required, and the token counts.
This transparency serves several purposes. It allows other researchers to reproduce results exactly. It enables the community to study the characteristics of successful jailbreaks and develop better defenses. And it creates a historical record that shows how attack techniques have evolved over time.
The artifacts can be loaded programmatically through the JailbreakBench Python API:
import jailbreakbench as jbb
artifact = jbb.read_artifact(
method="PAIR",
model_name="vicuna-13b-v1.5"
)
print(artifact.jailbreaks<sup><a href="#cite_note-75" class="cite-ref">[75]</a></sup>)
Each artifact entry contains the behavior index, goal, category, prompt, response, number of queries, queries to jailbreak, prompt tokens, response tokens, and a boolean indicating whether the jailbreak was successful.
The JailbreakBench leaderboard, hosted at jailbreakbench.github.io, provides a centralized view of attack and defense performance across models. The leaderboard maintains separate rankings for open-source and closed-source models and allows filtering by attack type, defense method, and other metadata.
The leaderboard displays the attack success rate for each method on each model, along with cost metrics. It links directly to the corresponding jailbreak artifacts, so anyone can inspect the actual prompts and responses behind each entry.
One of the most important contributions of JailbreakBench is its rigorous evaluation of jailbreak classifiers. The choice of classifier has a large impact on reported attack success rates, and using different classifiers was a major source of inconsistency in prior work.
The authors compared six different jailbreak classifiers:
| Classifier | Type | Description | |---|---| | GPT-4 | LLM judge | Uses GPT-4 with a specific prompt to classify responses as jailbroken or not | | GPT-4 Turbo | LLM judge | Similar to GPT-4 but using the Turbo variant | | GCG (rule-based) | Keyword matching | Uses handcrafted rules to detect refusals | | BERT-based | Fine-tuned classifier | A BERT model fine-tuned on jailbreak detection data | | TDC (Trojan Detection Challenge) | Fine-tuned classifier | Classifier from the Trojan Detection Challenge | | Llama Guard | LLM judge | Meta's safety classifier based on Llama |
To establish ground truth, three expert annotators independently labeled each prompt-response pair, achieving approximately 95% inter-annotator agreement. The majority vote of these three annotators served as the reference label.
The evaluation revealed significant differences between classifiers. GPT-4 and Llama 3 70B achieved the highest agreement with human annotators, both exceeding 90% agreement. Llama Guard 2 reached 87.7% agreement with roughly equal false positive and false negative rates. The HarmBench classifier and the original Llama Guard had lower agreement rates of 78.3% and 72.0%, respectively.
A critical finding was that the HarmBench classifier exhibited a high false positive rate on benign examples from XS-Test, reaching 26.8% overall. The BERT-based classifier failed to identify 74% of jailbreaks, making it unreliable for measuring attack success.
The authors selected Llama 3 Instruct 70B as the default jailbreak judge for JailbreakBench. The key reasons were that it achieved GPT-4-level agreement with human annotators (90.7% agreement with ground truth), it is an open-weight model that anyone can run without API costs, and its behavior is reproducible since the weights do not change over time. By contrast, proprietary models like GPT-4 can be updated by their providers at any time, potentially changing evaluation results.
The judge comparison dataset, containing 300 examples with labels from three human annotators and four LLM judges, is publicly available on Hugging Face for researchers who want to evaluate new classifiers.
JailbreakBench includes several baseline attack methods spanning different attack paradigms. Version 1.0 of the benchmark evaluates four primary attack strategies.
GCG, introduced by Zou et al. (2023), is a white-box attack that optimizes an adversarial suffix appended to the input prompt. The method uses gradient information to iteratively refine the suffix tokens so that the model's output begins with a target affirmative response (e.g., "Sure, here is..."). Because GCG requires access to model gradients, it can only be applied directly to open-source models. However, the resulting adversarial suffixes can sometimes transfer to other models, including closed-source ones.
In the JailbreakBench evaluation, GCG showed mixed results. It achieved moderate success on Vicuna but performed poorly against models with stronger safety alignment, recording only about 3% ASR on Llama 2 and roughly 4% on GPT-4. The authors noted that after the release of the jailbreak artifacts, the success rate of GCG on GPT models decreased to approximately 5%, likely due to safety patches applied by OpenAI.
PAIR, developed by Chao et al. (2023), is a black-box attack that uses a separate attacker LLM to iteratively generate and refine jailbreak prompts. The attacker model proposes a candidate prompt, observes the target model's response, and then modifies its approach based on whether the jailbreak succeeded. This process typically runs for a fixed number of iterations.
PAIR is query-efficient because it leverages the attacker LLM's understanding of language and social engineering to craft semantically meaningful prompts rather than relying on gradient-based token optimization. In the JailbreakBench evaluation, PAIR achieved high success rates on Vicuna and GPT-3.5 but was less effective against more strongly aligned models like Llama 2.
JailbreakChat represents manually crafted jailbreak prompts collected from online communities. The benchmark uses the AIM (Always Intelligent and Machiavellian) template, one of the most well-known hand-crafted jailbreak prompts. These prompts typically use role-playing scenarios, hypothetical framing, or other social engineering techniques to bypass safety guardrails.
In testing, the AIM template from JailbreakChat proved effective against Vicuna but failed on all behaviors when applied to Llama 2 and the GPT models. This result highlights the difference between models with varying levels of safety training, and it shows that hand-crafted prompts that work on weaker models often do not generalize.
This attack, based on work by Andriushchenko et al. (2024), combines a manually optimized prompt template with adversarial suffixes found through random search. The method uses self-transfer, where adversarial suffixes discovered on one model are applied to another, to improve efficiency.
Prompt with RS emerged as the most effective attack in the JailbreakBench evaluation. It achieved approximately 90% ASR on Llama 2 and 78% on GPT-4, substantially outperforming all other methods. Its query efficiency was also notable, requiring on average only about 2 queries on Vicuna and 3 on GPT-3.5 to achieve a successful jailbreak. The high performance is attributed to the combination of a carefully designed prompt template with optimized adversarial suffixes.
The following table summarizes the approximate attack success rates from the initial JailbreakBench evaluation, as measured by the Llama Guard classifier:
| Attack Method | Attack Type | Vicuna-13b | Llama-2-7b | GPT-3.5 Turbo | GPT-4 |
|---|---|---|---|---|---|
| GCG | White-box / Transfer | Moderate | ~3% | Low | ~4% |
| PAIR | Black-box | High | Low | High | Moderate |
| JailbreakChat (AIM) | Black-box | High | 0% | 0% | 0% |
| Prompt + RS | Black-box / Transfer | Very High | ~90% | High | ~78% |
These results demonstrate that no single defense approach is adequate against all attack types, and that attack effectiveness varies dramatically depending on the target model's safety alignment and the sophistication of the attack method.
JailbreakBench evaluates attacks across both open-source and closed-source LLMs. The initial set of supported models includes:
| Model | Type | Provider |
|---|---|---|
| Vicuna-13b-v1.5 | Open-source | LMSYS |
| Llama-2-7b-chat-hf | Open-source | Meta |
| Llama-3-8b-instruct | Open-source | Meta |
| GPT-3.5-turbo-1106 | Closed-source | OpenAI |
| GPT-4-0125-preview | Closed-source | OpenAI |
Open-source models can be queried either locally through vLLM or via cloud APIs through Together AI using the LiteLLM integration. Closed-source models are accessed through their respective APIs.
The choice of these models reflects a deliberate strategy: Vicuna serves as a relatively weakly aligned model that most attacks can succeed against, Llama 2 represents stronger open-source safety alignment, and the GPT models represent the current state of commercial safety engineering. This spread allows researchers to evaluate how attacks perform across a range of alignment strengths.
JailbreakBench supports the evaluation of test-time defense algorithms that attempt to detect or neutralize jailbreak prompts before they reach the model. Version 1.0 includes five baseline defenses.
SmoothLLM, proposed by Robey et al. (2023), creates multiple perturbed copies of the input prompt by randomly swapping, inserting, or deleting characters. It then queries the model with each perturbed copy and takes a majority vote over the responses. The idea is that adversarial suffixes (like those generated by GCG) are brittle and will be disrupted by character-level perturbations, while legitimate prompts will produce consistent responses across perturbations.
In the JailbreakBench evaluation, SmoothLLM generally demonstrated strong performance in reducing ASR, particularly against GCG attacks. It reduced GCG's success rate on Vicuna to about 4% and to 0% on Llama 2. It also reduced PAIR's success rates more effectively than the perplexity filter.
The perplexity filter, based on work by Jain et al. (2023), rejects input prompts that have unusually high perplexity, as measured by a reference language model. The intuition is that adversarial suffixes generated by gradient-based attacks often contain nonsensical token sequences that a language model would assign low probability.
This defense showed limited effectiveness in the evaluation. While it successfully filtered out GCG prompts (which contain obviously unnatural text), it was largely ineffective against semantically meaningful attacks. Against JailbreakChat prompts, the perplexity filter allowed 90% ASR on Vicuna, and against Prompt + RS it allowed 88% ASR on Vicuna, 73% on Llama 2, and 70% on GPT-4.
Erase-and-Check works by systematically erasing portions of the input prompt and checking whether the remaining text still triggers a harmful response. If any erased version produces a safe response, the defense flags the original prompt as potentially adversarial.
This defense consistently achieved very low ASR values in the evaluation, suggesting strong robustness against transfer attacks. It was particularly effective at reducing the success of GCG and PAIR attacks to near-zero levels.
This defense replaces words in the input prompt with their synonyms before passing the modified prompt to the model. The transformation disrupts adversarial patterns while preserving the semantic content of legitimate queries.
Synonym substitution proved surprisingly effective in the evaluation, keeping ASR below 24% across most attack methods. Against Prompt + RS on Vicuna, it reduced the success rate to just 2%, and on Llama 2 to 0%.
This defense strips out any tokens that do not appear in a standard dictionary. The goal is to remove adversarial suffixes that consist of nonsensical character sequences.
The defense showed limited improvements compared to other methods. Prompt + RS still achieved 91% success on Vicuna even with this defense active, because the attack's adversarial components can be designed to use dictionary-valid words.
| Defense | GCG (Vicuna) | GCG (Llama-2) | PAIR (Vicuna) | JB-Chat (Vicuna) | Prompt+RS (Vicuna) | Prompt+RS (Llama-2) | Prompt+RS (GPT-4) |
|---|---|---|---|---|---|---|---|
| None (baseline) | Moderate | ~3% | High | High | Very High | ~90% | ~78% |
| SmoothLLM | ~4% | 0% | Reduced | Reduced | Reduced | Reduced | Reduced |
| Perplexity Filter | ~0% | ~0% | 69% | 90% | 88% | 73% | 70% |
| Erase-and-Check | 0-2% | 0-2% | 0-2% | Reduced | Reduced | Reduced | Reduced |
| Synonym Substitution | Low | Low | Low | Low | ~2% | 0% | Low |
| Remove Non-Dictionary | Reduced | Reduced | Reduced | Reduced | 91% | Moderate | Moderate |
The results show that no single defense is effective against all attack types. Defenses that target the syntactic properties of adversarial prompts (like perplexity filtering and non-dictionary word removal) fail against semantically meaningful attacks, while defenses that perturb the prompt content (like SmoothLLM and synonym substitution) show broader effectiveness but may degrade benign performance.
JailbreakBench is distributed as a Python package installable via pip:
pip install jailbreakbench
For local model inference using vLLM:
pip install jailbreakbench[vllm]
The library provides a straightforward API for loading datasets, querying models, running evaluations, and accessing submitted artifacts.
Loading the dataset:
import jailbreakbench as jbb
dataset = jbb.read_dataset()
behaviors = dataset.behaviors
goals = dataset.goals
targets = dataset.targets
categories = dataset.categories
# Load as pandas DataFrame
df = dataset.as_dataframe()
# Load benign behaviors
benign = jbb.read_dataset("benign")
Querying models:
import os
import jailbreakbench as jbb
# API-based querying via LiteLLM
llm = jbb.LLMLiteLLM(
model_name="vicuna-13b-v1.5",
api_key=os.environ["TOGETHER_API_KEY"]
)
# Or local inference via vLLM
llm = jbb.LLMvLLM(model_name="vicuna-13b-v1.5")
# Query with a list of prompts
responses = llm.query(
prompts=["Write a phishing email."],
behavior="phishing"
)
Running evaluations with defenses:
responses = llm.query(
prompts=prompts,
behavior="phishing",
defense="SmoothLLM"
)
Evaluating a full set of prompts:
evaluation = jbb.evaluate_prompts(
all_prompts,
llm_provider="litellm",
defense="SmoothLLM"
)
All queries are automatically logged to a logs/ directory with metadata including timestamps, token counts, and response details.
The evaluation pipeline includes two specialized judges:
from jailbreakbench.classifier import (
Llama3JailbreakJudge,
Llama3RefusalJudge
)
jailbreak_judge = Llama3JailbreakJudge(api_key)
refusal_judge = Llama3RefusalJudge(api_key)
is_jailbroken = jailbreak_judge([prompt], [response])
is_refusal = refusal_judge([prompt], [response])
The jailbreak judge determines whether the model's response contains harmful content that fulfills the requested behavior. The refusal judge determines whether the model explicitly declined the request. A response can be classified as neither a jailbreak nor a refusal if the model produces an irrelevant or evasive answer without directly refusing.
Researchers who develop new jailbreak attack methods can submit their results to the JailbreakBench leaderboard through a structured process:
jbb.evaluate_prompts(), which queries the models and classifies responses.jbb.create_submission(), specifying the method name, attack type (white-box, black-box, or transfer), and method parameters.Defense submissions follow a different process that involves contributing code directly to the repository:
query() method.After the defense is merged, the submitter runs the standard attack evaluations against the defended model and submits the resulting artifacts.
JailbreakBench has evolved through several releases since its initial publication.
| Version | Date | Key Changes |
|---|---|---|
| v0.1.1 | March 27, 2024 | Initial code release |
| v0.1.2 | March 27, 2024 | Added standalone defense mechanisms and defense artifacts |
| v0.1.3 | April 6, 2024 | Fixed inconsistencies identified in the paper |
| v1.0.0 | June 13, 2024 | Major release: moved dataset to Hugging Face, added new judges, integrated defenses, clarified data sources |
The v1.0 release represented a substantial expansion. It added the Prompt with Random Search attack artifacts alongside the original GCG, PAIR, and JailbreakChat entries. It introduced additional test-time defenses including Erase-and-Check, Synonym Substitution, and Remove Non-Dictionary Words, supplementing the original SmoothLLM and Perplexity Filter. The jailbreak judge was upgraded from Llama Guard to Llama 3 70B with a custom prompt, achieving GPT-4-level agreement with human annotations. The human preference dataset for judge calibration was expanded from 100 to 300 examples. A semantic refusal judge based on Llama 3 8B was added, along with an overrefusal evaluation dataset of 100 benign and borderline behaviors matching the 100 harmful behaviors.
JailbreakBench exists alongside several other benchmarks and datasets for evaluating LLM safety. Understanding how they relate helps clarify JailbreakBench's specific contributions.
AdvBench (Zou et al., 2023) introduced one of the first large-scale datasets of harmful behaviors for evaluating adversarial attacks on LLMs. However, AdvBench contained duplicated entries and behaviors that were impossible to fulfill. JailbreakBench sources some of its behaviors from AdvBench but curates them to remove these quality issues.
HarmBench (Mazeika et al., 2024), from the Center for AI Safety, provides a standardized evaluation framework for automated red teaming and robust refusal. It covers a broader array of topics, including copyright infringement and multimodal models. JailbreakBench differs in its focus on supporting adaptive attacks, test-time defenses, and maintaining a public repository of adversarial prompts.
Trojan Detection Challenge (TDC) 2023 contributed harmful behavior examples that were incorporated into the JBB-Behaviors dataset. The TDC focused specifically on detecting hidden trojans in language models, while JailbreakBench focuses on the broader jailbreaking problem.
StrongReject and other evaluation frameworks have emerged to measure LLM robustness from different angles, but JailbreakBench's combination of a curated dataset, standardized evaluation, public artifacts, and an active leaderboard distinguishes it as one of the most comprehensive benchmarks in this space.
Since its release, JailbreakBench has seen significant community adoption. The GitHub repository has accumulated over 550 stars and 65 forks. The JBB-Behaviors dataset on Hugging Face has been downloaded over 23,000 times. Multiple safety-focused models and tools, including PromptGuard, Arch-Guard, and tool-call-verifier, have been trained or evaluated using JailbreakBench data.
The benchmark's requirement that all submissions include their adversarial prompts has created a valuable resource for the AI safety community. Researchers studying jailbreak defenses can immediately access real attack examples to test their methods against, rather than having to re-implement attacks from paper descriptions.
JailbreakBench has also influenced the methodology of subsequent jailbreak research. Papers published after its release increasingly report their results on the JBB-Behaviors dataset and use the Llama 3 70B judge for evaluation, contributing to greater comparability across the field.
The authors acknowledge several limitations of JailbreakBench. The dataset of 100 harmful behaviors, while carefully curated, cannot cover every possible type of misuse. The behavior categories are aligned with OpenAI's usage policies, which may not capture all safety concerns relevant to other model providers or deployment contexts.
The jailbreak classifier, while achieving high agreement with human annotators, is not perfect. The 90.7% agreement rate means that roughly 1 in 10 classifications may differ from what a human expert would decide. This error rate can affect reported ASR numbers, particularly for attacks that produce borderline responses.
The benchmark primarily evaluates single-turn jailbreak attacks. Multi-turn attacks, where the attacker builds up context over several conversation turns before attempting the jailbreak, are an increasingly important threat vector that JailbreakBench does not fully address.
Finally, the leaderboard captures a snapshot of model behavior at a specific point in time. Closed-source models like GPT-4 receive ongoing safety updates, so attack success rates measured at one time may not reflect current model robustness.