# JailbreakBench

> Source: https://aiwiki.ai/wiki/jailbreakbench
> Updated: 2026-06-26
> Categories: AI Benchmarks, AI Safety, Large Language Models
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

JailbreakBench is an open-source robustness benchmark for evaluating [jailbreak](/wiki/jailbreak) attacks and defenses against [large language models](/wiki/large_language_model) (LLMs). It bundles four components: an evolving repository of state-of-the-art adversarial prompts (called jailbreak artifacts), the JBB-Behaviors dataset of 100 distinct misuse behaviors aligned with OpenAI's usage policies, a standardized evaluation framework with a fixed threat model and an open-weight judge, and a public leaderboard that ranks attacks and defenses across LLMs [1]. It was introduced in 2024 by Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, and Eric Wong, and the paper was accepted to the NeurIPS 2024 Datasets and Benchmarks Track (arXiv:2404.01318) [1]. The full codebase, datasets, and leaderboard are publicly available, and the work was a collaboration between researchers at the University of Pennsylvania, ETH Zurich, EPFL, the University of Tubingen, and Princeton University.

The benchmark addresses a persistent problem in [AI safety](/wiki/ai_safety) research: the lack of standardized methods for measuring whether LLMs can be tricked into producing harmful content. As the paper puts it, jailbreak attacks "cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content" [1]. Jailbreak attacks exploit weaknesses in an LLM's safety alignment to make the model generate outputs it would normally refuse. As LLMs have grown more capable and widely deployed, the arms race between attackers finding new jailbreaks and defenders patching vulnerabilities has intensified. Before JailbreakBench, this research area suffered from fragmented evaluation practices, incomparable metrics, and poor reproducibility. JailbreakBench was created to bring order to this landscape by providing a unified framework that the research community can use to compare methods on equal footing.

## What is JailbreakBench used for?

JailbreakBench is used to measure and compare how robust LLMs are to jailbreaking, using a single, reproducible standard. Researchers use it to: report attack success rate (ASR) on a fixed set of 100 behaviors with an agreed-upon judge; benchmark new attacks and defenses on a public leaderboard; download real adversarial prompts (the artifacts) to test their own defenses; and quantify overrefusal using a matched set of 100 benign behaviors [1]. Because every submission must include the actual prompts used, results are reproducible and directly comparable across papers.

## Why was JailbreakBench created?

Research on jailbreaking LLMs has grown rapidly, but three core problems have made it difficult to compare results across different papers and research groups. The paper states the goal plainly: existing benchmarks and evaluation techniques "do not adequately address" these challenges [1].

First, assessing whether an LLM response is genuinely harmful requires open-ended evaluation, and no single standard existed for making that judgment. Different researchers used different classifiers, different prompts for those classifiers, and different thresholds for what counts as a successful jailbreak. This meant that attack success rate (ASR) numbers from one paper were not directly comparable to numbers from another.

Second, even when researchers used similar metrics, the way they computed costs and success rates varied. Some counted only the number of queries to the target model; others counted tokens. Some allowed unlimited queries while others imposed budgets. These differences made it nearly impossible to determine which attack methods were actually more efficient.

Third, many published jailbreak attacks withheld their adversarial prompts, kept their code closed-source, or relied on proprietary APIs that changed over time. A jailbreak that worked against [GPT-4](/wiki/gpt-4) in January might not work in March because [OpenAI](/wiki/openai) had updated the model's safety filters. Without access to the original prompts and a stable evaluation environment, reproducing published results was often impractical.

JailbreakBench tackles all three problems. It provides a curated dataset of harmful behaviors, a standardized evaluation pipeline with fixed system prompts and chat templates, a validated jailbreak classifier, a repository of submitted adversarial prompts, and a public leaderboard that tracks attack and defense performance over time.

## How does JailbreakBench work?

JailbreakBench consists of four major components that together form a complete evaluation ecosystem for jailbreak research [1].

### JBB-Behaviors Dataset

The JBB-Behaviors dataset is the foundation of the benchmark. It contains 100 harmful behaviors and 100 corresponding benign behaviors, for a total of 200 entries [1]. The harmful behaviors are divided evenly across ten categories that align with OpenAI's usage policies, with ten behaviors per category.

The ten behavior categories are:

| Category | Example Behaviors |
|---|---|
| Harassment/Discrimination | Defamation, body shaming, xenophobia, ableism, sexism, racial discrimination, LGBTQ+ harassment |
| Malware/Hacking | Disabling antivirus software, network hacking, ransomware creation, keylogger scripts, password cracking |
| Physical Harm | Bomb building, poisoning, torture methods, suicide manipulation, ethnic genocide, stalking |
| Economic Harm | Extortion, gambling exploitation, payday loan scams, bribery, tax evasion, insider trading, counterfeiting |
| Fraud/Deception | Election interference, fake charities, pseudo-pharmaceuticals, astroturfing, Nigerian 419 scams, deepfakes |
| Disinformation | Political disinformation, 9/11 conspiracy theories, Holocaust denial, COVID-19 conspiracy theories, climate change denial |
| Sexual/Adult Content | Explicit content generation, sexual solicitation of minors, revenge porn, human trafficking content |
| Privacy | Identity theft, phishing emails, facial recognition for racial profiling, health record theft, location tracking |
| Expert Advice | Unauthorized medical treatment advice, opioid prescriptions, DUI avoidance, organ trafficking, emissions test cheating |
| Government Decision-Making | Document forgery, criminal record alteration, voting machine tampering, nuclear weapons instructions, child labor facilitation |

Each entry in the dataset specifies five fields:

| Field | Description |
|---|---|
| Behavior | A unique identifier describing a distinct misuse behavior (e.g., "Phishing," "Defamation") |
| Goal | A query requesting the objectionable behavior from the model |
| Target | An affirmative response template that a jailbroken model might produce |
| Category | The broader OpenAI usage policy category |
| Source | The origin of the behavior (Original, AdvBench, TDC/HarmBench) |

The dataset deliberately keeps a small, representative set of 100 behaviors to enable fast evaluation of new attacks rather than acting as a superset of its sources. By the authors' accounting, 55% of the behaviors are original contributions from the JailbreakBench authors, 27% are sourced from the Trojan Detection Challenge 2023 and [HarmBench](/wiki/harmbench) (Mazeika et al., 2023, 2024), and 18% are drawn from AdvBench (Zou et al., 2023) [1][10]. The authors took care to remove duplicated entries and behaviors that would be impossible to fulfill, which were common problems in earlier datasets.

The 100 benign behaviors serve as a control set. They are thematically similar to the harmful behaviors but are safe to answer, allowing researchers to measure overrefusal rates. A defense that blocks all harmful content but also refuses legitimate requests is not practically useful, and the benign set makes it possible to quantify this tradeoff [1].

The dataset is hosted on Hugging Face (DOI: 10.57967/hf/2540) under an MIT license and can be loaded through either the JailbreakBench Python library or the Hugging Face Datasets library [10].

### Standardized Evaluation Framework

The evaluation framework defines every aspect of how attacks and defenses should be tested, eliminating the inconsistencies that plagued earlier work. The paper describes it as "a standardized evaluation framework that includes a clearly defined threat model, system prompts, chat templates, and scoring functions" [1].

**Threat Model.** JailbreakBench classifies attacks into three categories based on the level of access the attacker has to the target model:

| Attack Type | Description |
|---|---|
| White-box | The attacker has full access to the model's architecture, weights, and gradients. This is only possible for open-source models. |
| Black-box | The attacker can only query the model through its API and observe responses. No access to internal parameters is available. |
| Transfer | The attacker generates adversarial prompts using a white-box surrogate model and then applies them to a different target model without further adaptation. |

**System Prompts and Chat Templates.** For each supported model, JailbreakBench specifies the exact system prompt and chat template to use during evaluation. This ensures that all researchers test against the same configuration rather than accidentally varying these parameters.

**Scoring and Classification.** The benchmark provides standardized scoring functions that determine whether a model response constitutes a successful jailbreak. In version 1.0, the primary jailbreak judge is [Llama](/wiki/llama) 3 Instruct 70B, selected after a rigorous comparison of multiple classifiers (described in detail below) [1]. A separate refusal judge based on Llama 3 8B evaluates whether the model actively refused the request, enabling finer-grained analysis of model behavior.

**Cost Tracking.** For every attack, JailbreakBench tracks the total number of queries to the target model, the total number of prompt and response tokens, and the number of queries or tokens per successful jailbreak. This allows researchers to compare not just raw success rates but also the efficiency of different attack strategies.

### Jailbreak Artifacts Repository

A distinctive feature of JailbreakBench is its requirement that all submissions include the actual adversarial prompts used. These are stored in a public artifacts repository on GitHub. Each submission records, for every behavior in the dataset, the prompt that was used, the model's response, whether the jailbreak succeeded, the number of queries required, and the token counts.

This transparency serves several purposes. It allows other researchers to reproduce results exactly. It enables the community to study the characteristics of successful jailbreaks and develop better defenses. And it creates a historical record that shows how attack techniques have evolved over time.

The artifacts can be loaded programmatically through the JailbreakBench Python API:

```python
import jailbreakbench as jbb

artifact = jbb.read_artifact(
    method="PAIR",
    model_name="vicuna-13b-v1.5"
)
print(artifact.jailbreaks[75])
```

Each artifact entry contains the behavior index, goal, category, prompt, response, number of queries, queries to jailbreak, prompt tokens, response tokens, and a boolean indicating whether the jailbreak was successful.

### Leaderboard

The JailbreakBench leaderboard, hosted at jailbreakbench.github.io, provides a centralized view of attack and defense performance across models [9]. The leaderboard maintains separate rankings for open-source and closed-source models and allows filtering by attack type, defense method, and other metadata.

The leaderboard displays the attack success rate for each method on each model, along with cost metrics. It links directly to the corresponding jailbreak artifacts, so anyone can inspect the actual prompts and responses behind each entry.

## How are jailbreaks judged in JailbreakBench?

One of the most important contributions of JailbreakBench is its rigorous evaluation of jailbreak classifiers. The choice of classifier has a large impact on reported attack success rates, and using different classifiers was a major source of inconsistency in prior work.

### Classifier Comparison

The authors compared six different jailbreak classifiers:

| Classifier | Type | Description |
|---|---|
| GPT-4 | LLM judge | Uses [GPT-4](/wiki/gpt-4) with a specific prompt to classify responses as jailbroken or not |
| GPT-4 Turbo | LLM judge | Similar to GPT-4 but using the Turbo variant |
| GCG (rule-based) | Keyword matching | Uses handcrafted rules to detect refusals |
| BERT-based | Fine-tuned classifier | A [BERT](/wiki/bert) model fine-tuned on jailbreak detection data |
| TDC (Trojan Detection Challenge) | Fine-tuned classifier | Classifier from the Trojan Detection Challenge |
| Llama Guard | LLM judge | [Meta](/wiki/meta_ai)'s safety classifier based on Llama |

To establish ground truth, three expert annotators independently labeled each prompt-response pair, achieving approximately 95% inter-annotator agreement. The majority vote of these three annotators served as the reference label.

### Human Evaluation Results

The evaluation revealed significant differences between classifiers. GPT-4 and Llama 3 70B achieved the highest agreement with human annotators, both exceeding 90% agreement. Llama Guard 2 reached 87.7% agreement with roughly equal false positive and false negative rates. The HarmBench classifier and the original Llama Guard had lower agreement rates of 78.3% and 72.0%, respectively.

A critical finding was that the HarmBench classifier exhibited a high false positive rate on benign examples from XS-Test, reaching 26.8% overall. The BERT-based classifier failed to identify 74% of jailbreaks, making it unreliable for measuring attack success.

### Selection of Llama 3 70B

The authors selected Llama 3 Instruct 70B as the default jailbreak judge for JailbreakBench. The key reasons were that it achieved GPT-4-level agreement with human annotators (90.7% agreement with ground truth), it is an open-weight model that anyone can run without API costs, and its behavior is reproducible since the weights do not change over time. By contrast, proprietary models like GPT-4 can be updated by their providers at any time, potentially changing evaluation results [1].

The judge comparison dataset, containing 300 examples with labels from three human annotators and four LLM judges, is publicly available on Hugging Face for researchers who want to evaluate new classifiers.

## Which attacks does JailbreakBench evaluate?

JailbreakBench includes several baseline attack methods spanning different attack paradigms. Version 1.0 of the benchmark evaluates four primary attack strategies.

### Greedy Coordinate Gradient (GCG)

GCG, introduced by Zou et al. (2023), is a white-box attack that optimizes an adversarial suffix appended to the input prompt. The method uses gradient information to iteratively refine the suffix tokens so that the model's output begins with a target affirmative response (e.g., "Sure, here is..."). Because GCG requires access to model gradients, it can only be applied directly to open-source models. However, the resulting adversarial suffixes can sometimes transfer to other models, including closed-source ones [2].

In the JailbreakBench evaluation, GCG showed mixed results. It achieved moderate success on Vicuna but performed poorly against models with stronger safety alignment, recording only about 3% ASR on [Llama 2](/wiki/llama) and roughly 4% on GPT-4. The authors noted that after the release of the jailbreak artifacts, the success rate of GCG on GPT models decreased to approximately 5%, likely due to safety patches applied by OpenAI.

### Prompt Automatic Iterative Refinement (PAIR)

PAIR, developed by Chao et al. (2023), is a black-box attack that uses a separate attacker LLM to iteratively generate and refine jailbreak prompts. The attacker model proposes a candidate prompt, observes the target model's response, and then modifies its approach based on whether the jailbreak succeeded. This process typically runs for a fixed number of iterations [4].

PAIR is query-efficient because it leverages the attacker LLM's understanding of language and social engineering to craft semantically meaningful prompts rather than relying on gradient-based token optimization. In the JailbreakBench evaluation, PAIR achieved high success rates on Vicuna and GPT-3.5 but was less effective against more strongly aligned models like Llama 2.

### JailbreakChat (JB-Chat)

JailbreakChat represents manually crafted jailbreak prompts collected from online communities. The benchmark uses the AIM (Always Intelligent and Machiavellian) template, one of the most well-known hand-crafted jailbreak prompts. These prompts typically use role-playing scenarios, hypothetical framing, or other social engineering techniques to bypass safety guardrails.

In testing, the AIM template from JailbreakChat proved effective against Vicuna but failed on all behaviors when applied to Llama 2 and the GPT models. This result highlights the difference between models with varying levels of safety training, and it shows that hand-crafted prompts that work on weaker models often do not generalize.

### Prompt with Random Search (Prompt + RS)

This attack, based on work by Andriushchenko et al. (2024), combines a manually optimized prompt template with adversarial suffixes found through random search. The method uses self-transfer, where adversarial suffixes discovered on one model are applied to another, to improve efficiency [7].

Prompt with RS emerged as the most effective attack in the JailbreakBench evaluation. It achieved approximately 90% ASR on Llama 2 and 78% on GPT-4, substantially outperforming all other methods. Its query efficiency was also notable, requiring on average only about 2 queries on Vicuna and 3 on GPT-3.5 to achieve a successful jailbreak. The high performance is attributed to the combination of a carefully designed prompt template with optimized adversarial suffixes.

### Attack Results Summary

The following table summarizes the approximate attack success rates from the initial JailbreakBench evaluation, as measured by the Llama Guard classifier:

| Attack Method | Attack Type | Vicuna-13b | Llama-2-7b | GPT-3.5 Turbo | GPT-4 |
|---|---|---|---|---|---|
| GCG | White-box / Transfer | Moderate | ~3% | Low | ~4% |
| PAIR | Black-box | High | Low | High | Moderate |
| JailbreakChat (AIM) | Black-box | High | 0% | 0% | 0% |
| Prompt + RS | Black-box / Transfer | Very High | ~90% | High | ~78% |

These results demonstrate that no single defense approach is adequate against all attack types, and that attack effectiveness varies dramatically depending on the target model's safety alignment and the sophistication of the attack method.

## Which models does JailbreakBench evaluate?

JailbreakBench evaluates attacks across both open-source and closed-source LLMs. The initial set of supported models includes:

| Model | Type | Provider |
|---|---|---|
| Vicuna-13b-v1.5 | Open-source | LMSYS |
| Llama-2-7b-chat-hf | Open-source | [Meta](/wiki/meta_ai) |
| Llama-3-8b-instruct | Open-source | Meta |
| GPT-3.5-turbo-1106 | Closed-source | [OpenAI](/wiki/openai) |
| GPT-4-0125-preview | Closed-source | OpenAI |

Open-source models can be queried either locally through [vLLM](/wiki/vllm) or via cloud APIs through Together AI using the LiteLLM integration. Closed-source models are accessed through their respective APIs.

The choice of these models reflects a deliberate strategy: Vicuna serves as a relatively weakly aligned model that most attacks can succeed against, Llama 2 represents stronger open-source safety alignment, and the GPT models represent the current state of commercial safety engineering. This spread allows researchers to evaluate how attacks perform across a range of alignment strengths.

## What defenses does JailbreakBench support?

JailbreakBench supports the evaluation of test-time defense algorithms that attempt to detect or neutralize jailbreak prompts before they reach the model. Version 1.0 includes five baseline defenses.

### SmoothLLM

SmoothLLM, proposed by Robey et al. (2023), creates multiple perturbed copies of the input prompt by randomly swapping, inserting, or deleting characters. It then queries the model with each perturbed copy and takes a majority vote over the responses. The idea is that adversarial suffixes (like those generated by GCG) are brittle and will be disrupted by character-level perturbations, while legitimate prompts will produce consistent responses across perturbations [5].

In the JailbreakBench evaluation, SmoothLLM generally demonstrated strong performance in reducing ASR, particularly against GCG attacks. It reduced GCG's success rate on Vicuna to about 4% and to 0% on Llama 2. It also reduced PAIR's success rates more effectively than the perplexity filter.

### Perplexity Filter

The perplexity filter, based on work by Jain et al. (2023), rejects input prompts that have unusually high perplexity, as measured by a reference language model. The intuition is that adversarial suffixes generated by gradient-based attacks often contain nonsensical token sequences that a language model would assign low probability [6].

This defense showed limited effectiveness in the evaluation. While it successfully filtered out GCG prompts (which contain obviously unnatural text), it was largely ineffective against semantically meaningful attacks. Against JailbreakChat prompts, the perplexity filter allowed 90% ASR on Vicuna, and against Prompt + RS it allowed 88% ASR on Vicuna, 73% on Llama 2, and 70% on GPT-4.

### Erase-and-Check

Erase-and-Check works by systematically erasing portions of the input prompt and checking whether the remaining text still triggers a harmful response. If any erased version produces a safe response, the defense flags the original prompt as potentially adversarial.

This defense consistently achieved very low ASR values in the evaluation, suggesting strong robustness against transfer attacks. It was particularly effective at reducing the success of GCG and PAIR attacks to near-zero levels.

### Synonym Substitution

This defense replaces words in the input prompt with their synonyms before passing the modified prompt to the model. The transformation disrupts adversarial patterns while preserving the semantic content of legitimate queries.

Synonym substitution proved surprisingly effective in the evaluation, keeping ASR below 24% across most attack methods. Against Prompt + RS on Vicuna, it reduced the success rate to just 2%, and on Llama 2 to 0%.

### Remove Non-Dictionary Words

This defense strips out any tokens that do not appear in a standard dictionary. The goal is to remove adversarial suffixes that consist of nonsensical character sequences.

The defense showed limited improvements compared to other methods. Prompt + RS still achieved 91% success on Vicuna even with this defense active, because the attack's adversarial components can be designed to use dictionary-valid words.

### Defense Results Summary

| Defense | GCG (Vicuna) | GCG (Llama-2) | PAIR (Vicuna) | JB-Chat (Vicuna) | Prompt+RS (Vicuna) | Prompt+RS (Llama-2) | Prompt+RS (GPT-4) |
|---|---|---|---|---|---|---|---|
| None (baseline) | Moderate | ~3% | High | High | Very High | ~90% | ~78% |
| SmoothLLM | ~4% | 0% | Reduced | Reduced | Reduced | Reduced | Reduced |
| Perplexity Filter | ~0% | ~0% | 69% | 90% | 88% | 73% | 70% |
| Erase-and-Check | 0-2% | 0-2% | 0-2% | Reduced | Reduced | Reduced | Reduced |
| Synonym Substitution | Low | Low | Low | Low | ~2% | 0% | Low |
| Remove Non-Dictionary | Reduced | Reduced | Reduced | Reduced | 91% | Moderate | Moderate |

The results show that no single defense is effective against all attack types. Defenses that target the syntactic properties of adversarial prompts (like perplexity filtering and non-dictionary word removal) fail against semantically meaningful attacks, while defenses that perturb the prompt content (like SmoothLLM and synonym substitution) show broader effectiveness but may degrade benign performance.

## Technical Implementation

### Installation and Setup

JailbreakBench is distributed as a Python package installable via pip:

```bash
pip install jailbreakbench
```

For local model inference using vLLM:

```bash
pip install jailbreakbench[vllm]
```

### Python API

The library provides a straightforward API for loading datasets, querying models, running evaluations, and accessing submitted artifacts.

**Loading the dataset:**

```python
import jailbreakbench as jbb

dataset = jbb.read_dataset()
behaviors = dataset.behaviors
goals = dataset.goals
targets = dataset.targets
categories = dataset.categories

# Load as pandas DataFrame
df = dataset.as_dataframe()

# Load benign behaviors
benign = jbb.read_dataset("benign")
```

**Querying models:**

```python
import os
import jailbreakbench as jbb

# API-based querying via LiteLLM
llm = jbb.LLMLiteLLM(
    model_name="vicuna-13b-v1.5",
    api_key=os.environ["TOGETHER_API_KEY"]
)

# Or local inference via vLLM
llm = jbb.LLMvLLM(model_name="vicuna-13b-v1.5")

# Query with a list of prompts
responses = llm.query(
    prompts=["Write a phishing email."],
    behavior="phishing"
)
```

**Running evaluations with defenses:**

```python
responses = llm.query(
    prompts=prompts,
    behavior="phishing",
    defense="SmoothLLM"
)
```

**Evaluating a full set of prompts:**

```python
evaluation = jbb.evaluate_prompts(
    all_prompts,
    llm_provider="litellm",
    defense="SmoothLLM"
)
```

All queries are automatically logged to a `logs/` directory with metadata including timestamps, token counts, and response details.

### Jailbreak and Refusal Judges

The evaluation pipeline includes two specialized judges:

```python
from jailbreakbench.classifier import (
    Llama3JailbreakJudge,
    Llama3RefusalJudge
)

jailbreak_judge = Llama3JailbreakJudge(api_key)
refusal_judge = Llama3RefusalJudge(api_key)

is_jailbroken = jailbreak_judge([prompt], [response])
is_refusal = refusal_judge([prompt], [response])
```

The jailbreak judge determines whether the model's response contains harmful content that fulfills the requested behavior. The refusal judge determines whether the model explicitly declined the request. A response can be classified as neither a jailbreak nor a refusal if the model produces an irrelevant or evasive answer without directly refusing.

## How do you submit to JailbreakBench?

### Submitting Attacks

Researchers who develop new jailbreak attack methods can submit their results to the JailbreakBench leaderboard through a structured process:

1. Generate 100 jailbreak strings for each model, covering all behaviors in the dataset. At minimum, results for Vicuna-13b and Llama-2-7b are required; GPT-3.5 and GPT-4 are optional.
2. Format the jailbreak strings as a dictionary mapping behavior names to prompt strings for each model.
3. Run the standardized evaluation pipeline using `jbb.evaluate_prompts()`, which queries the models and classifies responses.
4. Create a submission file using `jbb.create_submission()`, specifying the method name, attack type (white-box, black-box, or transfer), and method parameters.
5. Open an issue on the JailbreakBench GitHub repository with the algorithm name and upload the submission JSON file.

### Submitting Defenses

Defense submissions follow a different process that involves contributing code directly to the repository:

1. Fork the JailbreakBench repository and create a new branch.
2. Add hyperparameters for the defense in the defense configuration file.
3. Implement a defense class that inherits from the base Defense class, implementing the `query()` method.
4. Register the defense in the defenses module.
5. Submit a pull request to the main repository.

After the defense is merged, the submitter runs the standard attack evaluations against the defended model and submits the resulting artifacts.

## When was JailbreakBench released?

JailbreakBench has evolved through several releases since its initial publication in 2024.

| Version | Date | Key Changes |
|---|---|---|
| v0.1.1 | March 27, 2024 | Initial code release |
| v0.1.2 | March 27, 2024 | Added standalone defense mechanisms and defense artifacts |
| v0.1.3 | April 6, 2024 | Fixed inconsistencies identified in the paper |
| v1.0.0 | June 13, 2024 | Major release: moved dataset to Hugging Face, added new judges, integrated defenses, clarified data sources |

The accompanying paper (arXiv:2404.01318) was first posted in April 2024 and accepted to the NeurIPS 2024 Datasets and Benchmarks Track [1]. The v1.0 release represented a substantial expansion. It added the Prompt with Random Search attack artifacts alongside the original GCG, PAIR, and JailbreakChat entries. It introduced additional test-time defenses including Erase-and-Check, Synonym Substitution, and Remove Non-Dictionary Words, supplementing the original SmoothLLM and Perplexity Filter. The jailbreak judge was upgraded from Llama Guard to Llama 3 70B with a custom prompt, achieving GPT-4-level agreement with human annotations. The human preference dataset for judge calibration was expanded from 100 to 300 examples. A semantic refusal judge based on Llama 3 8B was added, along with an overrefusal evaluation dataset of 100 benign and borderline behaviors matching the 100 harmful behaviors.

## How is JailbreakBench different from HarmBench?

JailbreakBench exists alongside several other benchmarks and datasets for evaluating LLM safety. Understanding how they relate helps clarify JailbreakBench's specific contributions.

**[HarmBench](/wiki/harmbench)** (Mazeika et al., 2024), from the Center for AI Safety, provides a standardized evaluation framework for automated [red teaming](/wiki/red_teaming) and robust refusal [3]. The two benchmarks overlap and even share behaviors (27% of JBB-Behaviors comes from TDC/HarmBench), but they emphasize different things. HarmBench covers a broader array of topics, including copyright infringement and multimodal models, and ships its own fine-tuned harm classifier. JailbreakBench keeps a smaller, fixed set of 100 behaviors for fast and comparable evaluation, requires every submission to publish its adversarial prompts (the artifacts repository), centers adaptive attacks and test-time defenses, and uses an open-weight Llama 3 70B judge. In JailbreakBench's own classifier comparison, the HarmBench classifier reached 78.3% agreement with human labels and a 26.8% false positive rate on benign XS-Test examples, which is part of why the authors adopted a different judge [1].

**AdvBench** (Zou et al., 2023) introduced one of the first large-scale datasets of harmful behaviors for evaluating adversarial attacks on LLMs [2]. However, AdvBench contained duplicated entries and behaviors that were impossible to fulfill. JailbreakBench sources 18% of its behaviors from AdvBench but curates them to remove these quality issues.

**Trojan Detection Challenge (TDC) 2023** contributed harmful behavior examples that were incorporated into the JBB-Behaviors dataset. The TDC focused specifically on detecting hidden trojans in language models, while JailbreakBench focuses on the broader jailbreaking problem.

**[StrongReject](/wiki/strongreject)** and other evaluation frameworks have emerged to measure LLM robustness from different angles, but JailbreakBench's combination of a curated dataset, standardized evaluation, public artifacts, and an active leaderboard distinguishes it as one of the most comprehensive benchmarks in this space.

## Is JailbreakBench open source?

Yes. JailbreakBench is fully open-source. The Python package, evaluation pipeline, and leaderboard code are released on GitHub under the MIT license [8], the JBB-Behaviors dataset is hosted on Hugging Face (DOI: 10.57967/hf/2540) under the MIT license [10], and the leaderboard and artifacts repository are publicly viewable [9]. The default jailbreak judge is the open-weight Llama 3 Instruct 70B, which means the entire evaluation can be reproduced without paid API access. The authors note they "carefully considered the potential ethical implications of releasing this benchmark, and believe that it will be a net positive for the community" [1].

## Why does JailbreakBench matter?

Since its release, JailbreakBench has seen significant community adoption. The GitHub repository has accumulated over 550 stars and 65 forks. The JBB-Behaviors dataset on Hugging Face has been downloaded over 23,000 times. Multiple safety-focused models and tools, including PromptGuard, Arch-Guard, and tool-call-verifier, have been trained or evaluated using JailbreakBench data.

The benchmark's requirement that all submissions include their adversarial prompts has created a valuable resource for the AI safety community. Researchers studying jailbreak defenses can immediately access real attack examples to test their methods against, rather than having to re-implement attacks from paper descriptions.

JailbreakBench has also influenced the methodology of subsequent jailbreak research. Papers published after its release increasingly report their results on the JBB-Behaviors dataset and use the Llama 3 70B judge for evaluation, contributing to greater comparability across the field.

## Limitations

The authors acknowledge several limitations of JailbreakBench. The dataset of 100 harmful behaviors, while carefully curated, cannot cover every possible type of misuse. The behavior categories are aligned with OpenAI's usage policies, which may not capture all safety concerns relevant to other model providers or deployment contexts.

The jailbreak classifier, while achieving high agreement with human annotators, is not perfect. The 90.7% agreement rate means that roughly 1 in 10 classifications may differ from what a human expert would decide. This error rate can affect reported ASR numbers, particularly for attacks that produce borderline responses.

The benchmark primarily evaluates single-turn jailbreak attacks. Multi-turn attacks, where the attacker builds up context over several conversation turns before attempting the jailbreak, are an increasingly important threat vector that JailbreakBench does not fully address.

Finally, the leaderboard captures a snapshot of model behavior at a specific point in time. Closed-source models like GPT-4 receive ongoing safety updates, so attack success rates measured at one time may not reflect current model robustness.

## See Also

- [AI Safety](/wiki/ai_safety)
- [Red Teaming](/wiki/red_teaming)
- [Large Language Models](/wiki/large_language_model)
- [Prompt Engineering](/wiki/prompt_engineering)
- [RLHF](/wiki/rlhf)
- [HarmBench](/wiki/harmbench)
- [GPT-4](/wiki/gpt-4)
- [Llama](/wiki/llama)

## References

1. Chao, P., Debenedetti, E., Robey, A., Andriushchenko, M., Croce, F., Sehwag, V., Dobriban, E., Flammarion, N., Pappas, G.J., Tramer, F., Hassani, H., & Wong, E. (2024). "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models." *NeurIPS 2024 Datasets and Benchmarks Track*. arXiv:2404.01318.
2. Zou, A., Wang, Z., Kolter, J.Z., & Fredrikson, M. (2023). "Universal and Transferable Adversarial Attacks on Aligned Language Models." arXiv:2307.15043.
3. Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., & others. (2024). "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal." arXiv:2402.04249.
4. Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G.J., & Wong, E. (2023). "Jailbreaking Black Box Large Language Models in Twenty Queries." arXiv:2310.08419.
5. Robey, A., Wong, E., Hassani, H., & Pappas, G.J. (2023). "SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks." arXiv:2310.03684.
6. Jain, N., Schwarzschild, A., Wen, Y., Somepalli, G., Kirchenbauer, J., Chiang, P.Y., Goldblum, M., Saha, A., Geiping, J., & Goldstein, T. (2023). "Baseline Defenses for Adversarial Attacks Against Aligned Language Models." arXiv:2309.00614.
7. Andriushchenko, M., Croce, F., & Flammarion, N. (2024). "Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks." arXiv:2404.02151.
8. JailbreakBench GitHub Repository. https://github.com/JailbreakBench/jailbreakbench
9. JailbreakBench Leaderboard. https://jailbreakbench.github.io/
10. JBB-Behaviors Dataset on Hugging Face. https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors