HarmBench

AI Benchmarks AI Safety Large Language Models

21 min read

Updated Jun 25, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 25, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v4 · 4,119 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

HarmBench is a standardized evaluation framework for automated red teaming and robust refusal of large language models (LLMs). It was introduced in 2024 by Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, and colleagues at the Center for AI Safety (CAIS), the University of Illinois at Urbana-Champaign, and collaborating institutions, and was presented at the 41st International Conference on Machine Learning (ICML) in July 2024 ^[1]. HarmBench provides a common test bed of 510 harmful behaviors and a fixed evaluation pipeline so that many jailbreak and attack methods can be compared fairly against many LLMs on identical conditions. In its flagship study the authors ran a large-scale comparison of 18 red teaming methods against 33 target LLMs and defenses, and they also introduced R2D2 (Robust Refusal Dynamic Defense), an adversarial-training method that greatly improves model robustness ^[1]. As the paper states, "Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess new methods" ^[1]. HarmBench is openly available under the MIT license at github.com/centerforaisafety/HarmBench.

What is HarmBench?

HarmBench is an AI safety benchmark that standardizes how automated red teaming of LLMs is measured. Red teaming attempts to find inputs that cause a model to produce harmful, dangerous, or policy-violating outputs; HarmBench supplies a fixed set of harmful behaviors, fixed generation settings, and a robust automated classifier so that the attack success rate (ASR) of one method can be compared directly against another. The benchmark covers 510 unique behaviors across four functional categories (standard, contextual, copyright, and multimodal), and the flagship paper evaluates 18 attack methods against 33 target LLMs and defenses ^[1]. The single primary metric is the attack success rate, defined as the percentage of test cases that successfully elicit the target harmful behavior from a model ^[1].

Background: why was HarmBench created?

Automated red teaming has become an important area of AI safety research as large language models have grown more capable and widely deployed. The goal of red teaming is to identify vulnerabilities in LLMs by generating inputs that cause models to produce harmful, dangerous, or policy-violating outputs. Before HarmBench, the field lacked a unified framework for assessing red teaming methods. Individual papers introduced their own sets of harmful behaviors, used different evaluation metrics, and tested on varying subsets of models under different conditions. This made it difficult, if not impossible, to compare the effectiveness of different attack strategies or to measure real progress in defensive robustness.

The authors of HarmBench identified several specific problems with the state of red teaming evaluation at the time:

Narrow behavior sets: Many evaluations relied on small or homogeneous sets of harmful behaviors, often limited to a single type of harmful request. The widely used AdvBench dataset, for example, contained only 104 behaviors focused on direct harmful requests.
Inconsistent evaluation parameters: Studies varied in fundamental settings such as the number of tokens generated in model completions. The HarmBench authors found that "the number of tokens generated by the target model during evaluation drastically impacts the attack success rate (ASR) of red teaming methods" and that this unstandardized parameter can shift ASR by up to about 30 percentage points, making cross-paper comparisons misleading ^[1].
Unreliable metrics: Existing evaluations frequently used string matching, keyword detection, or uncalibrated LLM judges that could be easily gamed by adversarial outputs.

HarmBench was designed to address all three of these issues by establishing a comprehensive, standardized framework that the research community could adopt as a shared benchmark.

Who created HarmBench and when was it published?

HarmBench was created by Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. The research team includes members from the Center for AI Safety, the University of Illinois at Urbana-Champaign (Siebel School of Computing and Data Science, Department of Electrical and Computer Engineering, Information Trust Institute, and the National Center for Supercomputing Applications), Carnegie Mellon University, UC Berkeley, and Microsoft.

The paper was first released on arXiv on February 6, 2024 (arXiv:2402.04249), with a revised version published on February 27, 2024. It was accepted to ICML 2024 and published in the Proceedings of Machine Learning Research, Volume 235, pages 35181 to 35224. The conference took place July 21 to 27, 2024, in Vienna, Austria ^[1].

What are HarmBench's design principles?

HarmBench is built around three desirable properties that the authors identified as essential for a rigorous red teaming evaluation framework: breadth, comparability, and robust metrics ^[1].

Breadth

Evaluations should cover a wide range of harmful behaviors across different semantic categories and functional types. Previous benchmarks often tested only a narrow slice of possible harms, which could lead to misleading conclusions about a model's overall safety. HarmBench addresses this by including 510 behaviors spanning seven semantic categories and four functional types.

Comparability

A standardized benchmark must ensure that all methods are evaluated under identical conditions so that results can be directly compared. HarmBench achieves this by fixing key evaluation parameters, including the token generation budget (512 tokens), the decoding strategy (greedy decoding), and the hardware configuration. The framework also specifies a clear validation/test split to prevent methods from tuning on test data.

Robust Metrics

The evaluation classifier must be resistant to gaming. HarmBench develops a fine-tuned classifier that undergoes prequalification tests designed to catch common failure modes, such as models that initially refuse but then comply, random benign text that should not be classified as harmful, and unrelated harmful behaviors that do not match the specific behavior being tested.

How are HarmBench's behaviors organized?

HarmBench organizes its 510 behaviors along two dimensions: semantic categories (describing the type of harm) and functional categories (describing the structure of the behavior).

Semantic Categories

The seven semantic categories cover a broad range of potential harms:

Semantic Category	Description
Cybercrime and Unauthorized Intrusion	Behaviors related to hacking, malware creation, unauthorized system access, and other computer crimes
Chemical and Biological Weapons/Drugs	Behaviors involving synthesis of dangerous substances, weaponization of biological agents, or illicit drug manufacturing
Copyright Violations	Behaviors that attempt to reproduce copyrighted text, code, or other protected creative works
Misinformation and Disinformation	Behaviors aimed at generating false or misleading information, including propaganda and fake news
Harassment and Bullying	Behaviors involving targeted harassment, hate speech, threats, or intimidation of individuals or groups
Illegal Activities	Behaviors related to fraud, theft, illegal weapons, human trafficking, and other criminal conduct
General Harm	Behaviors that do not fit neatly into the other categories but still pose clear risks, including unsafe advice and content promoting self-harm

Functional Categories

The four functional categories define how behaviors are structured and what modality of input they involve:

Functional Category	Count	Description
Standard Behaviors	200	Self-contained harmful requests modeled after datasets such as AdvBench and the TDC 2023 Red Teaming Track. Each behavior is represented as a single text string.
Copyright Behaviors	100	Requests that ask the model to reproduce specific copyrighted material. These are evaluated using a hashing-based classifier rather than an LLM judge, since copyright infringement involves verbatim reproduction that can be detected objectively.
Contextual Behaviors	100	Behaviors that pair a harmful request with a context string. The context provides specific, realistic details that make the harmful request more targeted and differentially dangerous compared to what someone could find with a simple web search.
Multimodal Behaviors	110	Behaviors that combine an image with a textual instruction referencing the image. These test vision-language models (VLMs) and provide highly specific visual context that would be difficult to replicate through text alone.

The total 510 behaviors are divided into a validation set of 100 behaviors and a test set of 410 behaviors. Researchers are expected to develop and tune their methods on the validation set and report results only on the held-out test set ^[1].

What attack methods does HarmBench evaluate?

HarmBench evaluates 18 red teaming methods that span a range of attack strategies, from gradient-based optimization to black-box prompting techniques ^[1]. These methods are organized into several categories.

White-Box Gradient-Based Attacks

These attacks require direct access to the target model's weights and gradients. They optimize adversarial suffixes or token sequences that, when appended to a harmful prompt, increase the likelihood that the model will comply:

Attack	Description
GCG (Greedy Coordinate Gradient)	Optimizes adversarial suffixes token-by-token using gradient information. For each position, GCG computes the gradient of the cross-entropy loss with respect to the one-hot encoding of suffix tokens, selects the top-k candidate replacements, samples combinations uniformly at random, and greedily picks the substitution that minimizes the loss the most. Introduced by Zou et al. (2023) ^[2].
GCG-Multi	A variant of GCG that optimizes a single universal suffix across multiple harmful behaviors simultaneously.
GCG-Transfer	Uses adversarial suffixes generated by GCG on one model and tests their transferability to other models.
PEZ	Projects embeddings to the nearest vocabulary tokens using a continuous relaxation of the discrete token optimization problem.
GBDA (Gradient-Based Distributional Attack)	Optimizes a distribution over tokens rather than individual tokens, using the Gumbel-softmax trick to maintain differentiability.
UAT (Universal Adversarial Trigger)	Finds universal trigger sequences by performing gradient-based token search that maximizes the probability of a target output.
AutoPrompt	Uses gradient-guided search to automatically construct prompts by iteratively replacing trigger tokens with candidates selected based on gradient magnitude.

Black-Box LLM-Based Attacks

These methods use a separate attacker LLM to generate adversarial prompts, requiring only query access to the target model (no gradient information):

Attack	Description
PAIR (Prompt Automatic Iterative Refinement)	Uses an attacker LLM to iteratively refine jailbreak prompts based on the target model's responses. Inspired by social engineering techniques, PAIR typically requires fewer than 20 queries to produce a successful jailbreak. Introduced by Chao et al. (2023) ^[3].
TAP (Tree of Attacks with Pruning)	Extends PAIR by using tree-of-thought reasoning to explore a larger space of adversarial prompts while pruning unlikely candidates before sending them to the target. Introduced by Mehrotra et al. (2023) ^[4].
TAP-Transfer	Uses TAP-generated prompts from one target model and tests them on others.
Zero-Shot	Generates adversarial prompts without iterative refinement, typically using a single-turn instruction to an attacker LLM.
Stochastic Few-Shot	Provides a few examples of successful jailbreaks to an attacker LLM and samples new adversarial prompts with stochastic variation.

Other Attack Methods

Attack	Description
AutoDAN	Uses a genetic algorithm (evolutionary approach) to evolve adversarial prompts, combining and mutating successful candidates across generations.
PAP (Persuasive Adversarial Prompts)	Draws on decades of social science research on persuasion to construct natural-sounding prompts that use persuasion techniques to manipulate the LLM into compliance. Uses a systematic persuasion taxonomy to rewrite harmful queries into more convincing forms. Introduced by Zeng et al. (2024) ^[5].
Human Jailbreaks	A collection of manually crafted jailbreak templates sourced from online communities and prior research.
Direct Request	The baseline method that sends the harmful behavior directly to the target model without any adversarial modification.

For multimodal models, HarmBench additionally includes PGD (Projected Gradient Descent) for image perturbation, Adversarial Patch attacks, and Render Text (embedding text into images).

How does the HarmBench evaluation pipeline work?

The HarmBench evaluation pipeline consists of four sequential steps:

Generate Test Cases: An attack method generates adversarial prompts (test cases) for each behavior in the benchmark. For gradient-based methods, this involves optimizing suffix tokens; for LLM-based methods, this involves iterative prompt refinement.
Merge Test Cases (optional): For methods like GCG-Multi that generate a single universal suffix, this step distributes the shared test case across all behaviors.
Generate Completions: Each test case is fed to the target model, which generates a response using standardized parameters (512 tokens, greedy decoding).
Evaluate Completions: An automated classifier determines whether each completion constitutes a successful attack (i.e., the model produced the requested harmful content).

The primary metric is the Attack Success Rate (ASR), defined as the percentage of test cases for which the target model produces output that the classifier judges as fulfilling the harmful behavior ^[1]. The full evaluation produces an ASR matrix of shape (Behaviors x Attacks x Models).

The framework supports three execution modes: SLURM cluster execution for large-scale experiments, local sequential execution, and local parallel execution using Ray.

How does the automated classifier work?

A central challenge in red teaming evaluation is accurately determining whether an LLM's output actually fulfills a harmful request. HarmBench addresses this with three purpose-built classifiers:

cais/HarmBench-Llama-2-13b-cls: The primary classifier for standard and contextual behaviors. This is a fine-tuned Llama 2 13B Chat model trained on data distilled from GPT-4 (specifically gpt-4-0613). It achieves 93.19% average agreement with human judgments ^[1].
cais/HarmBench-Llama-2-13b-cls-multimodal-behaviors: A variant fine-tuned specifically for evaluating multimodal behavior completions.
cais/HarmBench-Mistral-7b-val-cls: A Mistral 7B-based validation classifier covering all behavior types.

For copyright behaviors, HarmBench uses a hashing-based classifier instead of an LLM judge. This approach directly checks whether the model's output contains verbatim copyrighted text, providing an objective and deterministic evaluation.

The classifiers undergo a set of prequalification tests to verify robustness. These tests include:

Refusal-then-continuation scenarios: Cases where a model initially refuses but then proceeds to provide harmful content. The classifier must correctly identify these as successful attacks.
Random benign text: Innocuous text that should never be classified as a harmful completion.
Unrelated harmful behaviors: Harmful content that does not match the specific behavior being tested. The classifier must not flag these as successful attacks.

Which models did HarmBench evaluate?

HarmBench evaluates 33 target LLMs and defenses, divided into open-source and closed-source categories along with multimodal variants ^[1].

Open-Source Text Models

Model	Parameter Count
Llama 2 7B Chat	7B
Llama 2 13B Chat	13B
Llama 2 70B Chat	70B
Vicuna 7B v1.5	7B
Vicuna 13B v1.5	13B
Koala 7B	7B
Koala 13B	13B
Orca 2 7B	7B
Orca 2 13B	13B
SOLAR 10.7B Instruct	10.7B
OpenChat 3.5	7B
Starling 7B	7B
Mistral 7B Instruct v0.2	7B
Mixtral 8x7B Instruct	46.7B (MoE)
Zephyr 7B	7B
Zephyr 7B + R2D2	7B
Baichuan 2 7B Chat	7B
Baichuan 2 13B Chat	13B
Qwen 7B Chat	7B
Qwen 14B Chat	14B
Qwen 72B Chat	72B

Closed-Source Text Models

Model	Provider
GPT-3.5 Turbo (0613)	OpenAI
GPT-3.5 Turbo (1106)	OpenAI
GPT-4 (0613)	OpenAI
GPT-4 (1106-preview)	OpenAI
Claude Instant 1	Anthropic
Claude 2	Anthropic
Claude 2.1	Anthropic
Gemini Pro	Google
Mistral Medium	Mistral AI

Multimodal Models

Model	Type
LLaVA v1.5	Open-source
InstructBLIP	Open-source
Qwen-VL-Chat	Open-source
GPT-4V	Closed-source

What did HarmBench find?

The large-scale evaluation across 18 attack methods and 33 models produced several notable findings. The authors summarized two headline results directly: "No current attack or defense is uniformly effective. All attacks have low ASR on at least one LLM, and all LLMs have poor robustness against at least one attack" ^[1], and that robustness is independent of model size.

No universal attack or defense

The most significant result was that no single attack method succeeded against all models, and no single model defended against all attacks. Every attack method showed low ASR on at least one target model, and every model was vulnerable to at least one attack. This finding highlights the importance of evaluating across diverse attack strategies rather than relying on any single method.

Robustness is independent of model size

Across six model families (Llama 2, Vicuna, Koala, Orca 2, Baichuan 2, and Qwen), four attack methods, and model sizes ranging from 7B to 70B parameters, the authors found that model size alone does not predict robustness. Larger models were not consistently more resistant to adversarial attacks. As the paper concludes, the "procedures and data used during training are far more important than model size in determining robustness to jailbreaks" ^[1].

Vulnerability variation across models

Different models showed very different vulnerability profiles. Mistral 7B Instruct exhibited some of the highest vulnerability rates across nearly all semantic categories, while models such as GPT-4 and Claude 2.1 demonstrated stronger resistance to most attack types. However, even the most robust models had blind spots against specific attack strategies.

Contextual and multimodal behaviors are more exploitable

Attacks targeting contextual and multimodal behaviors generally achieved higher success rates than attacks targeting standard text behaviors. ASR on vision-language models reached as high as 80% for multimodal behaviors. This is likely because contextual and multimodal behaviors provide highly specific information that renders the harmful request more concrete and harder for safety filters to catch.

Standardization matters

The paper demonstrated that inconsistent evaluation parameters across previous studies had led to unreliable comparisons. Specifically, varying the number of generated tokens from a short generation to the full 512-token budget could change the measured ASR by up to 30 percentage points, depending on the model and attack ^[1]. Standardizing this parameter alone had a substantial impact on the accuracy and fairness of comparisons.

What is R2D2 (Robust Refusal Dynamic Defense)?

As a demonstration of how HarmBench can enable the co-development of attacks and defenses, the authors introduced R2D2 (Robust Refusal Dynamic Defense), a new adversarial training method for improving LLM robustness ^[1].

Training Procedure

R2D2 fine-tunes an LLM on a dynamic pool of adversarial test cases that are continually updated by a strong optimization-based red teaming method. The procedure works as follows:

Adversarial test case generation: GCG is used as the adversary during training because it was found to be the most effective attack against robust models like Llama 2. To manage computational cost, R2D2 uses persistent test cases (carrying over optimized suffixes between training steps) rather than restarting GCG from scratch at each step, drawing on techniques from the fast adversarial training literature.
Away loss: This loss component opposes the GCG objective by pushing the model's output distribution away from complying with adversarial inputs. It discourages the model from generating harmful completions.
Toward loss: This loss component trains the model to produce refusal responses when presented with adversarial inputs. Combined with the away loss, it teaches the model both what not to generate and what to generate instead.
Supervised fine-tuning (SFT) loss: A standard language modeling loss on benign conversational data that preserves the model's general utility and prevents catastrophic forgetting during adversarial training.

Results

The R2D2 defense achieved the strongest robustness among all evaluated model-level defenses. The paper reports that "Zephyr 7B+R2D2 obtains state-of-the-art robustness against GCG among model-level defenses," cutting GCG ASR from 31.8% on Llama 2 7B Chat and 30.2% on Llama 2 13B Chat down to 5.9% ^[1]:

Model	GCG ASR
Llama 2 7B Chat	31.8%
Llama 2 13B Chat	30.2%
Zephyr 7B + R2D2	5.9%

Zephyr 7B + R2D2 achieved approximately 4 times lower ASR compared to the next most robust baseline (Llama 2 13B Chat) against GCG attacks ^[1]. Importantly, R2D2 preserved the model's general conversational ability: Zephyr 7B + R2D2 achieved an MT-Bench score of 6.0, comparable to Mistral 7B Instruct v0.2, indicating that the security gains did not come at the cost of degraded utility.

Limitations of R2D2

The R2D2 defense showed a clear limitation in generalization. Its robustness gains were most pronounced against attacks similar to the GCG adversary used during training. For attack methods that operate differently from GCG, such as PAIR, TAP, and Stochastic Few-Shot, the improvement provided by R2D2 was less significant. This finding suggests that achieving broad robustness may require incorporating multiple diverse attack methods into the adversarial training procedure.

What has HarmBench's impact been?

Since its release, HarmBench has become one of the most widely referenced benchmarks for evaluating LLM safety. It has been adopted by both academic researchers and industry practitioners as a standard evaluation protocol for red teaming experiments.

Research Community Adoption

HarmBench has been cited extensively in subsequent AI safety research. Researchers working on new attack methods (such as improved versions of GCG, adaptive attacks, and multi-turn jailbreaks) regularly report results on HarmBench to enable direct comparison with prior work. Similarly, teams developing new defense mechanisms use HarmBench to demonstrate that their methods improve robustness across a standardized set of behaviors.

Integration with Tools

The HarmBench framework has been integrated into several AI safety evaluation tools. For example, Promptfoo offers a HarmBench plugin that allows developers to evaluate their own models against the HarmBench behavior set without needing to set up the full evaluation pipeline from scratch.

Influence on Subsequent Benchmarks

HarmBench's design principles have influenced the development of subsequent benchmarks in the LLM safety space. JailbreakBench, published at NeurIPS 2024, built on HarmBench's approach while adding features such as a live leaderboard and standardized jailbreak artifacts ^[6]. GT-HarmBench (2025) extended HarmBench's framework by incorporating game-theoretic modeling of attacker-defender interactions.

What are HarmBench's limitations?

While HarmBench represents a significant step forward for standardized red teaming evaluation, the authors and subsequent research have identified several limitations:

Snapshot in time: The behavior set and model evaluations reflect the state of the field as of early 2024. New models and attack methods continue to emerge, and the benchmark requires ongoing updates to remain relevant.
Classifier limitations: Although the HarmBench classifier achieves high agreement with human judgments (93.19%), it is not perfect. Research by Raina et al. (2025) showed that the classifier can experience performance drops under stylistic prompt formatting variations. Adversarial outputs that use unusual formatting or languages may evade detection ^[7].
English-only focus: The benchmark behaviors are primarily in English, which limits its ability to evaluate safety in multilingual contexts.
Static behaviors: The behaviors are fixed and publicly available, which means that models could theoretically be trained to refuse specifically these behaviors without generalizing to other harmful requests (a form of overfitting to the benchmark).
Over-refusal not measured: HarmBench focuses on measuring whether models can be made to produce harmful content but does not measure the complementary problem of over-refusal, where models inappropriately refuse benign requests.

References

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., & Hendrycks, D. (2024). "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal." *Proceedings of the 41st International Conference on Machine Learning (ICML)*, PMLR 235, pp. 35181-35224. arXiv:2402.04249. ↩
Zou, A., Wang, Z., Kolter, J. Z., & Fredrikson, M. (2023). "Universal and Transferable Adversarial Attacks on Aligned Language Models." arXiv:2307.15043. ↩
Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., & Wong, E. (2023). "Jailbreaking Black Box Large Language Models in Twenty Queries." arXiv:2310.08419. ↩
Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y., & Karbasi, A. (2023). "Tree of Attacks: Jailbreaking Black-Box LLMs Automatically." *NeurIPS 2024*. arXiv:2312.02119. ↩
Zeng, Y., Lin, H., Zhang, J., Yang, D., Jia, R., & Shi, W. (2024). "How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs." arXiv:2401.06373. ↩
Chao, P., Dobriban, E., Hassani, H., Pappas, G. J., & Wong, E. (2024). "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models." *NeurIPS 2024 Datasets and Benchmarks Track*. arXiv:2404.01318. ↩
Raina, V., Liusie, A., & Gales, M. (2025). "Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges." arXiv:2503.04474. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

AdvBench AgentHarm HELM (Holistic Evaluation of Language Models)JailbreakBench Refusal direction ToxiGen WMDP benchmark

What is HarmBench?

Background: why was HarmBench created?

Who created HarmBench and when was it published?

What are HarmBench's design principles?

Breadth

Comparability

Robust Metrics

How are HarmBench's behaviors organized?

Semantic Categories

Functional Categories

What attack methods does HarmBench evaluate?

White-Box Gradient-Based Attacks

Black-Box LLM-Based Attacks

Other Attack Methods

How does the HarmBench evaluation pipeline work?

How does the automated classifier work?

Which models did HarmBench evaluate?

Open-Source Text Models

Closed-Source Text Models

Multimodal Models

What did HarmBench find?

No universal attack or defense

Robustness is independent of model size

Vulnerability variation across models

Contextual and multimodal behaviors are more exploitable

Standardization matters

What is R2D2 (Robust Refusal Dynamic Defense)?

Training Procedure

Results

Limitations of R2D2

What has HarmBench's impact been?

Research Community Adoption

Integration with Tools

Influence on Subsequent Benchmarks

What are HarmBench's limitations?

See Also

References

Improve this article

Related Articles

HaluEval

JailbreakBench

AdvBench

Humanity's Last Exam

METR

SimpleQA

What links here

Related Articles

HaluEval

JailbreakBench

AdvBench

Humanity's Last Exam

METR

SimpleQA

What links here