HarmBench is a standardized evaluation framework for automated red teaming and robust refusal of large language models (LLMs). Developed by researchers at the Center for AI Safety (CAIS) and the University of Illinois at Urbana-Champaign, HarmBench provides a systematic approach to evaluating both attack methods that attempt to elicit harmful outputs from LLMs and the defensive measures designed to prevent them. The framework includes 510 carefully curated harmful behaviors, 18 red teaming attack methods, and a standardized evaluation pipeline that enables fair comparison across methods. HarmBench was presented at the 41st International Conference on Machine Learning (ICML) in July 2024 and is openly available under the MIT license.
Automated red teaming has become an important area of AI safety research as large language models have grown more capable and widely deployed. The goal of red teaming is to identify vulnerabilities in LLMs by generating inputs that cause models to produce harmful, dangerous, or policy-violating outputs. Before HarmBench, the field lacked a unified framework for assessing red teaming methods. Individual papers introduced their own sets of harmful behaviors, used different evaluation metrics, and tested on varying subsets of models under different conditions. This made it difficult, if not impossible, to compare the effectiveness of different attack strategies or to measure real progress in defensive robustness.
The authors of HarmBench identified several specific problems with the state of red teaming evaluation at the time:
HarmBench was designed to address all three of these issues by establishing a comprehensive, standardized framework that the research community could adopt as a shared benchmark.
HarmBench was created by Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. The research team includes members from the Center for AI Safety, the University of Illinois at Urbana-Champaign (Siebel School of Computing and Data Science, Department of Electrical and Computer Engineering, Information Trust Institute, and the National Center for Supercomputing Applications), Carnegie Mellon University, UC Berkeley, and Microsoft.
The paper was first released on arXiv on February 6, 2024 (arXiv:2402.04249), with a revised version published on February 27, 2024. It was accepted to ICML 2024 and published in the Proceedings of Machine Learning Research, Volume 235, pages 35181 to 35224. The conference took place July 21 to 27, 2024, in Vienna, Austria.
HarmBench is built around three desirable properties that the authors identified as essential for a rigorous red teaming evaluation framework:
Evaluations should cover a wide range of harmful behaviors across different semantic categories and functional types. Previous benchmarks often tested only a narrow slice of possible harms, which could lead to misleading conclusions about a model's overall safety. HarmBench addresses this by including 510 behaviors spanning seven semantic categories and four functional types.
A standardized benchmark must ensure that all methods are evaluated under identical conditions so that results can be directly compared. HarmBench achieves this by fixing key evaluation parameters, including the token generation budget (512 tokens), the decoding strategy (greedy decoding), and the hardware configuration. The framework also specifies a clear validation/test split to prevent methods from tuning on test data.
The evaluation classifier must be resistant to gaming. HarmBench develops a fine-tuned classifier that undergoes prequalification tests designed to catch common failure modes, such as models that initially refuse but then comply, random benign text that should not be classified as harmful, and unrelated harmful behaviors that do not match the specific behavior being tested.
HarmBench organizes its 510 behaviors along two dimensions: semantic categories (describing the type of harm) and functional categories (describing the structure of the behavior).
The seven semantic categories cover a broad range of potential harms:
| Semantic Category | Description |
|---|---|
| Cybercrime and Unauthorized Intrusion | Behaviors related to hacking, malware creation, unauthorized system access, and other computer crimes |
| Chemical and Biological Weapons/Drugs | Behaviors involving synthesis of dangerous substances, weaponization of biological agents, or illicit drug manufacturing |
| Copyright Violations | Behaviors that attempt to reproduce copyrighted text, code, or other protected creative works |
| Misinformation and Disinformation | Behaviors aimed at generating false or misleading information, including propaganda and fake news |
| Harassment and Bullying | Behaviors involving targeted harassment, hate speech, threats, or intimidation of individuals or groups |
| Illegal Activities | Behaviors related to fraud, theft, illegal weapons, human trafficking, and other criminal conduct |
| General Harm | Behaviors that do not fit neatly into the other categories but still pose clear risks, including unsafe advice and content promoting self-harm |
The four functional categories define how behaviors are structured and what modality of input they involve:
| Functional Category | Count | Description |
|---|---|---|
| Standard Behaviors | 200 | Self-contained harmful requests modeled after datasets such as AdvBench and the TDC 2023 Red Teaming Track. Each behavior is represented as a single text string. |
| Copyright Behaviors | 100 | Requests that ask the model to reproduce specific copyrighted material. These are evaluated using a hashing-based classifier rather than an LLM judge, since copyright infringement involves verbatim reproduction that can be detected objectively. |
| Contextual Behaviors | 100 | Behaviors that pair a harmful request with a context string. The context provides specific, realistic details that make the harmful request more targeted and differentially dangerous compared to what someone could find with a simple web search. |
| Multimodal Behaviors | 110 | Behaviors that combine an image with a textual instruction referencing the image. These test vision-language models (VLMs) and provide highly specific visual context that would be difficult to replicate through text alone. |
The total 510 behaviors are divided into a validation set of 100 behaviors and a test set of 410 behaviors. Researchers are expected to develop and tune their methods on the validation set and report results only on the held-out test set.
HarmBench evaluates 18 red teaming methods that span a range of attack strategies, from gradient-based optimization to black-box prompting techniques. These methods are organized into several categories.
These attacks require direct access to the target model's weights and gradients. They optimize adversarial suffixes or token sequences that, when appended to a harmful prompt, increase the likelihood that the model will comply:
| Attack | Description |
|---|---|
| GCG (Greedy Coordinate Gradient) | Optimizes adversarial suffixes token-by-token using gradient information. For each position, GCG computes the gradient of the cross-entropy loss with respect to the one-hot encoding of suffix tokens, selects the top-k candidate replacements, samples combinations uniformly at random, and greedily picks the substitution that minimizes the loss the most. Introduced by Zou et al. (2023). |
| GCG-Multi | A variant of GCG that optimizes a single universal suffix across multiple harmful behaviors simultaneously. |
| GCG-Transfer | Uses adversarial suffixes generated by GCG on one model and tests their transferability to other models. |
| PEZ | Projects embeddings to the nearest vocabulary tokens using a continuous relaxation of the discrete token optimization problem. |
| GBDA (Gradient-Based Distributional Attack) | Optimizes a distribution over tokens rather than individual tokens, using the Gumbel-softmax trick to maintain differentiability. |
| UAT (Universal Adversarial Trigger) | Finds universal trigger sequences by performing gradient-based token search that maximizes the probability of a target output. |
| AutoPrompt | Uses gradient-guided search to automatically construct prompts by iteratively replacing trigger tokens with candidates selected based on gradient magnitude. |
These methods use a separate attacker LLM to generate adversarial prompts, requiring only query access to the target model (no gradient information):
| Attack | Description |
|---|---|
| PAIR (Prompt Automatic Iterative Refinement) | Uses an attacker LLM to iteratively refine jailbreak prompts based on the target model's responses. Inspired by social engineering techniques, PAIR typically requires fewer than 20 queries to produce a successful jailbreak. Introduced by Chao et al. (2023). |
| TAP (Tree of Attacks with Pruning) | Extends PAIR by using tree-of-thought reasoning to explore a larger space of adversarial prompts while pruning unlikely candidates before sending them to the target. Introduced by Mehrotra et al. (2023). |
| TAP-Transfer | Uses TAP-generated prompts from one target model and tests them on others. |
| Zero-Shot | Generates adversarial prompts without iterative refinement, typically using a single-turn instruction to an attacker LLM. |
| Stochastic Few-Shot | Provides a few examples of successful jailbreaks to an attacker LLM and samples new adversarial prompts with stochastic variation. |
| Attack | Description |
|---|---|
| AutoDAN | Uses a genetic algorithm (evolutionary approach) to evolve adversarial prompts, combining and mutating successful candidates across generations. |
| PAP (Persuasive Adversarial Prompts) | Draws on decades of social science research on persuasion to construct natural-sounding prompts that use persuasion techniques to manipulate the LLM into compliance. Uses a systematic persuasion taxonomy to rewrite harmful queries into more convincing forms. Introduced by Zeng et al. (2024). |
| Human Jailbreaks | A collection of manually crafted jailbreak templates sourced from online communities and prior research. |
| Direct Request | The baseline method that sends the harmful behavior directly to the target model without any adversarial modification. |
For multimodal models, HarmBench additionally includes PGD (Projected Gradient Descent) for image perturbation, Adversarial Patch attacks, and Render Text (embedding text into images).
The HarmBench evaluation pipeline consists of four sequential steps:
Generate Test Cases: An attack method generates adversarial prompts (test cases) for each behavior in the benchmark. For gradient-based methods, this involves optimizing suffix tokens; for LLM-based methods, this involves iterative prompt refinement.
Merge Test Cases (optional): For methods like GCG-Multi that generate a single universal suffix, this step distributes the shared test case across all behaviors.
Generate Completions: Each test case is fed to the target model, which generates a response using standardized parameters (512 tokens, greedy decoding).
Evaluate Completions: An automated classifier determines whether each completion constitutes a successful attack (i.e., the model produced the requested harmful content).
The primary metric is the Attack Success Rate (ASR), defined as the percentage of test cases for which the target model produces output that the classifier judges as fulfilling the harmful behavior. The full evaluation produces an ASR matrix of shape (Behaviors x Attacks x Models).
The framework supports three execution modes: SLURM cluster execution for large-scale experiments, local sequential execution, and local parallel execution using Ray.
A central challenge in red teaming evaluation is accurately determining whether an LLM's output actually fulfills a harmful request. HarmBench addresses this with three purpose-built classifiers:
For copyright behaviors, HarmBench uses a hashing-based classifier instead of an LLM judge. This approach directly checks whether the model's output contains verbatim copyrighted text, providing an objective and deterministic evaluation.
The classifiers undergo a set of prequalification tests to verify robustness. These tests include:
HarmBench evaluates 33 models, divided into open-source and closed-source categories along with multimodal variants.
| Model | Parameter Count |
|---|---|
| Llama 2 7B Chat | 7B |
| Llama 2 13B Chat | 13B |
| Llama 2 70B Chat | 70B |
| Vicuna 7B v1.5 | 7B |
| Vicuna 13B v1.5 | 13B |
| Koala 7B | 7B |
| Koala 13B | 13B |
| Orca 2 7B | 7B |
| Orca 2 13B | 13B |
| SOLAR 10.7B Instruct | 10.7B |
| OpenChat 3.5 | 7B |
| Starling 7B | 7B |
| Mistral 7B Instruct v0.2 | 7B |
| Mixtral 8x7B Instruct | 46.7B (MoE) |
| Zephyr 7B | 7B |
| Zephyr 7B + R2D2 | 7B |
| Baichuan 2 7B Chat | 7B |
| Baichuan 2 13B Chat | 13B |
| Qwen 7B Chat | 7B |
| Qwen 14B Chat | 14B |
| Qwen 72B Chat | 72B |
| Model | Provider |
|---|---|
| GPT-3.5 Turbo (0613) | OpenAI |
| GPT-3.5 Turbo (1106) | OpenAI |
| GPT-4 (0613) | OpenAI |
| GPT-4 (1106-preview) | OpenAI |
| Claude Instant 1 | Anthropic |
| Claude 2 | Anthropic |
| Claude 2.1 | Anthropic |
| Gemini Pro | |
| Mistral Medium | Mistral AI |
| Model | Type |
|---|---|
| LLaVA v1.5 | Open-source |
| InstructBLIP | Open-source |
| Qwen-VL-Chat | Open-source |
| GPT-4V | Closed-source |
The large-scale evaluation across 18 attack methods and 33 models produced several notable findings.
The most significant result was that no single attack method succeeded against all models, and no single model defended against all attacks. Every attack method showed low ASR on at least one target model, and every model was vulnerable to at least one attack. This finding highlights the importance of evaluating across diverse attack strategies rather than relying on any single method.
Across six model families (Llama 2, Vicuna, Koala, Orca 2, Baichuan 2, and Qwen), four attack methods, and model sizes ranging from 7B to 70B parameters, the authors found that model size alone does not predict robustness. Larger models were not consistently more resistant to adversarial attacks. Instead, the training procedure and alignment methodology used during the model's development proved to be more important determinants of safety.
Different models showed very different vulnerability profiles. Mistral 7B Instruct exhibited some of the highest vulnerability rates across nearly all semantic categories, while models such as GPT-4 and Claude 2.1 demonstrated stronger resistance to most attack types. However, even the most robust models had blind spots against specific attack strategies.
Attacks targeting contextual and multimodal behaviors generally achieved higher success rates than attacks targeting standard text behaviors. ASR on vision-language models reached as high as 80% for multimodal behaviors. This is likely because contextual and multimodal behaviors provide highly specific information that renders the harmful request more concrete and harder for safety filters to catch.
The paper demonstrated that inconsistent evaluation parameters across previous studies had led to unreliable comparisons. Specifically, varying the number of generated tokens from a short generation to the full 512-token budget could change the measured ASR by up to 30 percentage points, depending on the model and attack. Standardizing this parameter alone had a substantial impact on the accuracy and fairness of comparisons.
As a demonstration of how HarmBench can enable the co-development of attacks and defenses, the authors introduced R2D2 (Robust Refusal Dynamic Defense), a new adversarial training method for improving LLM robustness.
R2D2 fine-tunes an LLM on a dynamic pool of adversarial test cases that are continually updated by a strong optimization-based red teaming method. The procedure works as follows:
Adversarial test case generation: GCG is used as the adversary during training because it was found to be the most effective attack against robust models like Llama 2. To manage computational cost, R2D2 uses persistent test cases (carrying over optimized suffixes between training steps) rather than restarting GCG from scratch at each step, drawing on techniques from the fast adversarial training literature.
Away loss: This loss component opposes the GCG objective by pushing the model's output distribution away from complying with adversarial inputs. It discourages the model from generating harmful completions.
Toward loss: This loss component trains the model to produce refusal responses when presented with adversarial inputs. Combined with the away loss, it teaches the model both what not to generate and what to generate instead.
Supervised fine-tuning (SFT) loss: A standard language modeling loss on benign conversational data that preserves the model's general utility and prevents catastrophic forgetting during adversarial training.
The R2D2 defense achieved the strongest robustness among all evaluated model-level defenses:
| Model | GCG ASR |
|---|---|
| Llama 2 7B Chat | 31.8% |
| Llama 2 13B Chat | 30.2% |
| Zephyr 7B + R2D2 | 5.9% |
Zephyr 7B + R2D2 achieved approximately 4 times lower ASR compared to the next most robust baseline (Llama 2 13B Chat) against GCG attacks. Importantly, R2D2 preserved the model's general conversational ability: Zephyr 7B + R2D2 achieved an MT-Bench score of 6.0, comparable to Mistral 7B Instruct v0.2, indicating that the security gains did not come at the cost of degraded utility.
The R2D2 defense showed a clear limitation in generalization. Its robustness gains were most pronounced against attacks similar to the GCG adversary used during training. For attack methods that operate differently from GCG, such as PAIR, TAP, and Stochastic Few-Shot, the improvement provided by R2D2 was less significant. This finding suggests that achieving broad robustness may require incorporating multiple diverse attack methods into the adversarial training procedure.
Since its release, HarmBench has become one of the most widely referenced benchmarks for evaluating LLM safety. It has been adopted by both academic researchers and industry practitioners as a standard evaluation protocol for red teaming experiments.
HarmBench has been cited extensively in subsequent AI safety research. Researchers working on new attack methods (such as improved versions of GCG, adaptive attacks, and multi-turn jailbreaks) regularly report results on HarmBench to enable direct comparison with prior work. Similarly, teams developing new defense mechanisms use HarmBench to demonstrate that their methods improve robustness across a standardized set of behaviors.
The HarmBench framework has been integrated into several AI safety evaluation tools. For example, Promptfoo offers a HarmBench plugin that allows developers to evaluate their own models against the HarmBench behavior set without needing to set up the full evaluation pipeline from scratch.
HarmBench's design principles have influenced the development of subsequent benchmarks in the LLM safety space. JailbreakBench, published at NeurIPS 2024, built on HarmBench's approach while adding features such as a live leaderboard and standardized jailbreak artifacts. GT-HarmBench (2025) extended HarmBench's framework by incorporating game-theoretic modeling of attacker-defender interactions.
While HarmBench represents a significant step forward for standardized red teaming evaluation, the authors and subsequent research have identified several limitations: