WMDP benchmark

AI Benchmarks AI Safety

15 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

19 citations

Revision

v3 · 2,934 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The Weapons of Mass Destruction Proxy (WMDP) is a publicly released benchmark and unlearning testbed for large language models, introduced in the paper "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning" by Nathaniel Li, Alexander Pan, Anjali Gopal, and over fifty co-authors, posted to arXiv on 5 March 2024 (arXiv:2403.03218) and published at the International Conference on Machine Learning (ICML) 2024.^[1]^[2] The dataset, hosted at cais/wmdp on Hugging Face and at wmdp.ai, is a collection of expert-authored, multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security; the original release contained 4,157 questions, and the project's actively maintained, dual-use-filtered version currently distributes 3,668 items split across wmdp-bio, wmdp-cyber, and wmdp-chem configurations.^[1]^[3]^[4] In the same paper the authors introduced CUT (Contrastive Unlearn Tuning), later simplified and renamed RMU (Representation Misdirection for Unlearning), a fine-tuning method that perturbs model activations on hazardous text while preserving activations on benign text, reducing performance on WMDP roughly to chance while largely retaining capability on the MMLU and MT-Bench evaluations.^[1]^[5]^[6] WMDP was developed by the Center for AI Safety (CAIS) in collaboration with Scale AI and a consortium of more than twenty academic and industry partners, with the explicit policy ambition of giving regulators, frontier labs, and the research community a shared, open instrument for measuring catastrophic-misuse risk.^[7]^[8]^[9]

Background

By late 2023 and early 2024 the question of whether large language models could meaningfully accelerate the development of biological, chemical, or cyber weapons had moved from speculative discussion into formal policy. The White House Executive Order on AI (Executive Order 14110, signed 30 October 2023) explicitly directed agencies to develop evaluations and red-teaming protocols for chemical, biological, radiological, and nuclear (CBRN) risks of dual-use foundation models, and major labs began publishing private "capability evaluations" for those domains.^[7]^[9]^[10] The WMDP authors framed their work as a response to a specific problem with that landscape: hazardous-capability evaluations existed but were almost exclusively private, held inside government bodies and frontier labs, which prevented the broader research community from independently studying mitigations or comparing methods.^[1]^[7]

WMDP was first announced on 5 March 2024 in a joint release from CAIS and Scale AI, with a coordinated launch on the project page wmdp.ai, the Hugging Face dataset cais/wmdp, the GitHub repository centerforaisafety/wmdp, and the arXiv preprint.^[1]^[7]^[8] The launch was covered the next day in Time magazine, which reported that Scale AI and CAIS had built a public benchmark of "more than 4,000 multiple-choice questions" that probe whether a model has internalised hazardous knowledge, together with an unlearning method intended to scrub that knowledge from a model rather than merely refusing to answer.^[11] The Time article quoted Dan Hendrycks, executive director of CAIS, expressing the hope that the benchmark would be adopted as one of the primary public references for open-source developers reporting safety properties of their releases.^[11]^[12]

The acronym "Weapons of Mass Destruction Proxy" is deliberate: the dataset is a proxy for catastrophic-misuse capability rather than a direct test of it. The authors describe the benchmark as built around "precursors, neighbors, and emulations of hazardous information" so that the public release does not itself constitute a uplift hazard, and they note that the development process was guided by counsel on U.S. export-control regimes including the International Traffic in Arms Regulations (ITAR).^[7]^[9]

Infobox

Attribute	Value
Full name	Weapons of Mass Destruction Proxy
First release	5 March 2024
arXiv ID	2403.03218
Venue	ICML 2024
Lead authors	Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti (and ~50 others)
Senior authors	Mantas Mazeika, Andy Zou, Dan Hendrycks
Organizations	Center for AI Safety, Scale AI, SecureBio, academic and industry consortium
Initial question count	4,157
Current question count	3,668 (after April 2024 data-quality cleanup)
Domains	Biosecurity, cybersecurity, chemical security
Format	Four-option multiple choice
Hugging Face slug	`cais/wmdp`
License	MIT
Project site	`wmdp.ai`
Companion method	CUT / RMU (Representation Misdirection for Unlearning)

Question generation methodology

WMDP's questions were written by domain experts rather than scraped or generated by language models. The authors recruited a consortium of academics and technical consultants in biosecurity, cybersecurity, and chemistry, and required that every candidate item be reviewed by at least two specialists from different organizations before inclusion.^[7]^[9] The CAIS blog states that authors and reviewers were drawn from "over twenty academic institutions, technical consultants, and industry partners."^[7]

A key design constraint distinguishes WMDP from prior multiple-choice tests of scientific knowledge such as MMLU: questions are filtered to remove items that a competent layperson could answer without specialised, dual-use knowledge. This is intended to keep the benchmark measuring a quantity correlated with genuine uplift potential, rather than measuring general scientific literacy. The CAIS announcement describes the questions as covering "precursors, neighbors, and emulations of hazardous information" so that the public artifact is informative about a model's grasp of the surrounding technical context without itself revealing operational detail.^[7]^[9]

A second filter screens out questions that the team judged too dangerous to publish; the authors report that "especially dangerous questions" are withheld and that development followed U.S. export-control guidance, including ITAR review.^[7] A third filter is post-hoc and ongoing: the dataset's Hugging Face card records that on 23 April 2024 the team removed wmdp-cyber questions deemed excessively long and wmdp-bio questions with insufficient dual-use potential, and on 8 March 2024 it adjusted wmdp-cyber for choice-randomization issues, taking the public total from 4,157 down to 3,668.^[3]^[4]

The current public split is approximately:^[3]^[4]^[6]

Subset	Approximate questions	Topic
`wmdp-bio`	~1,273	Viral engineering, pathogen biology, dual-use synthetic biology
`wmdp-cyber`	~1,987	Offensive cyber capabilities, exploitation, attack chains
`wmdp-chem`	~408	Hazardous chemical synthesis, deployment, and precursors

Each item is a four-option multiple choice with fields question, choices (length-4 sequence of strings), and answer (an integer 0-3 indexing the correct option), as documented in the Hugging Face dataset card.^[4] The README also carries an explicit warning that the benchmark data should never be allowed into training corpora, because that would render the benchmark uninformative.^[4]

How CUT and RMU work

The same paper that introduces the WMDP benchmark also introduces an unlearning method. In the earlier version of the paper (and in early press coverage) the method is called CUT (Contrastive Unlearn Tuning, also rendered "Controlling Unlearned Representations" in some materials); in the final ICML version and the released code the method is simplified and renamed RMU, for Representation Misdirection for Unlearning.^[1]^[5]^[6] The two share the same core idea, and the simplified RMU is what is implemented in centerforaisafety/wmdp and integrated into evaluation harnesses.^[5]^[6]

Forget loss and retain loss

RMU is a fine-tuning method inspired by representation engineering. It is trained against a two-term objective:^[1]^[5]

A forget loss that pushes the residual-stream activations produced by the model on hazardous text in a specific direction, while inflating their norm. Concretely, the loss takes the squared L2 distance between the updated model's activation on a forget-corpus example and a target vector that combines a random direction and a scale, so that whatever semantic structure the original representation encoded on that token is replaced by an unstructured, off-distribution signal.
A retain loss that takes the squared L2 distance between the updated model's activations on benign text and the original (frozen) model's activations on the same text, holding the model's behavior on retain-data approximately constant.

Training interleaves gradient updates on the forget and retain corpora, and (in the released configuration) updates only a small window of layers near the chosen target layer, both for memory efficiency and to localize the intervention.^[1]^[5]

A practically important property highlighted in the paper is that RMU "does not assume access to the hazardous information that it intends to remove." Because the forget loss only needs a corpus of topically related text, not the answers to the WMDP questions themselves, developers can run RMU without holding sensitive material on their machines, which the CAIS newsletter describes as one of the method's main attractions for industrial use.^[9]

Forget and retain corpora

The released cais/wmdp-corpora companion provides the text used for unlearning. The biosecurity forget corpus is a curated set of PubMed papers related to the topics that generate wmdp-bio items; the cyber forget corpus is built from a crawl of GitHub documents relevant to wmdp-cyber topics; and the retain corpus is general text such as Wikitext. The bio forget corpus must be requested by form, while the cyber forget corpus is available behind a password-gated archive.^[13]^[14]

Reported results

On Zephyr-7B-Beta, the paper reports that RMU reduces accuracy on wmdp-bio from 63.7% to 31.2% and on wmdp-cyber from 44.0% to 28.2%, while MMLU drops only from 58.1% to 57.1% and MT-Bench from 7.33 to 7.10.^[1]^[5] On larger models the same recipe was applied to Yi-34B-Chat and Mixtral-8x7B-Instruct, with pre-trained unlearned checkpoints released alongside the code.^[6] The CAIS blog summarises the result as CUT/RMU "reducing model performance on WMDP questions to random chance, while leaving accuracy nearly untouched on a standard battery of general knowledge tests."^[7]

The paper also reports that linear probes trained to recover the forgotten content from the unlearned model's internal states fail to do so, and that adversarial-suffix attacks using GCG do not recover the knowledge even after thousands of optimization steps.^[5] These are intended as evidence that RMU is doing more than refusal-style suppression, although later work has disputed that claim (see Limitations).

Implementations and adoption

WMDP has been integrated into several standard evaluation stacks since its release.

EleutherAI lm-evaluation-harness. The Hugging Face-hosted dataset is wrapped by tasks wmdp_bio, wmdp_cyber, wmdp_chemistry, and a grouped wmdp task in EleutherAI's lm-evaluation-harness; the harness's README describes the benchmark in the same terms as the paper and documents the per-subset question counts.^[15]
UK AISI Inspect Evals. The UK AI Safety Institute (now the AI Security Institute) ships WMDP as a built-in evaluation in its open-source Inspect framework, under inspect_evals/wmdp, with subtasks for biosecurity, cybersecurity, and chemical security. The framework scores models by accuracy across the multiple-choice items.^[16]
Reference unlearned checkpoints. The CAIS GitHub repository ships pre-unlearned versions of Zephyr-7B, Yi-34B-Chat, and Mixtral-8x7B-Instruct via Hugging Face, allowing third parties to evaluate the quality of the released RMU artifacts directly.^[6]
Public leaderboards. Third-party leaderboards such as llm-stats.com/benchmarks/wmdp aggregate self-reported and harness-evaluated scores from frontier models; as of mid-2026 the highest-ranked entry is Grok-4.1 Thinking at 0.840 on the combined wmdp task.^[17]

Significance

WMDP fills a narrow but consequential gap in the public benchmark ecosystem. Before its release, public multiple-choice tests of scientific competence (such as MMLU and MMLU-Pro) lumped together benign and dual-use knowledge, and public safety benchmarks (such as HarmBench and JailbreakBench) focused on whether a model would output harmful content given an adversarial prompt rather than whether it had the underlying knowledge. WMDP is built specifically to factor out compliance behavior: it measures whether the knowledge is in the model at all, regardless of whether the model is currently willing to articulate it.^[1]^[7]^[9]

A second contribution is the explicit framing of WMDP as both an evaluation and a target for an intervention. The paper's stated thesis is that unlearning, rather than refusal training or post-hoc filters, may be a more durable path toward reducing malicious-use risk; WMDP provides a metric against which different unlearning methods can be compared on equal footing.^[1]^[5] This framing has been picked up directly by policy commentary: the CAIS AI Safety Newsletter and the Time coverage both describe the benchmark and method together as a "concrete path toward reducing malicious use from LLMs."^[9]^[11]

Adoption by the AI Safety Institutes of the UK (via Inspect Evals) and by the EleutherAI lm-evaluation-harness has made WMDP one of a small number of public evaluations that are reported on routinely in third-party assessments of frontier models, alongside HarmBench and similar benchmarks.^[15]^[16]

Limitations and criticisms

The authors and subsequent commentators have identified several limitations.

Proxy fidelity. WMDP is a proxy benchmark by construction: the public release excludes the most operationally sensitive questions, and questions focus on adjacent and precursor knowledge rather than the exact information that would constitute uplift. The paper argues that performance on WMDP correlates with performance on a held-out set of more sensitive questions, but the proxy gap remains a known source of measurement noise.^[1]^[5]

Vulnerability to relearning. A recurring point in both the original paper and later analyses is that RMU's effects can be reversed. Once a model with unlearned weights is released, a downstream party with fine-tuning access can in many cases restore the original capability by fine-tuning on generic data; the CAIS newsletter notes explicitly that this limitation means CUT "does not mitigate risks from open-source models" and is most useful as a step applied to closed-source systems before deployment.^[9] Subsequent analyses on the AI Alignment Forum and in follow-up papers (e.g., "Unlearning via RMU is mostly shallow," and later work on representation-misdirection robustness) argue that RMU's intervention is concentrated in shallow features and can be largely recovered by simple rephrasing of the prompt or by light fine-tuning, suggesting the method functions more like a filter than true knowledge erasure.^[18]

Sandbagging. Because WMDP is a public, automatically scored multiple-choice test, models can in principle be trained or prompted to underperform specifically on WMDP while retaining hazardous capability elsewhere, an evasion known as "sandbagging" or "password-locking." This vulnerability has been documented in later evaluation-methodology work and is an active area of research in the regulator-oriented evaluation literature.^[19]

Benchmark saturation. As frontier models have improved, WMDP-Bio scores in particular have climbed well above the expert baselines reported in the paper, prompting discussion of whether the benchmark is approaching saturation and whether new, harder, expert-baselined items are needed.^[19]

Question quality. The April 2024 cleanup that removed roughly 500 items from the original release illustrates that not every question was clean dual-use signal; some wmdp-bio items were judged to lack adequate dual-use potential and were removed, and some wmdp-cyber items were removed for excessive length.^[4]

WMDP sits alongside several other public benchmarks and methods that touch on dangerous-capability evaluation and on knowledge editing or removal in large language models:

Benchmark / method	Focus	Relation to WMDP
MMLU	General scientific knowledge	Used by WMDP as the retain-side check that unlearning has not damaged general capability.^[1]
HarmBench	Behavioral red-teaming, harmful-output elicitation	Measures refusal/jailbreak behavior; WMDP measures underlying knowledge irrespective of compliance.^[7]
JailbreakBench	Standardized jailbreak attacks	Complementary; addresses behavioral elicitation rather than knowledge presence.
MACHIAVELLI	Power-seeking and ethical agency in text games	Earlier CAIS-led safety benchmark; methodologically distinct but shares senior authorship.
Representation engineering	Intervention on internal activations	The conceptual basis from which RMU's forget loss is derived.^[1]^[5]

References

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, et al., "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning", arXiv preprint, 2024-03-05. https://arxiv.org/abs/2403.03218. Accessed 2026-05-21. ↩
Nathaniel Li et al., "The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning", Proceedings of the 41st International Conference on Machine Learning (ICML), 2024. https://proceedings.mlr.press/v235/li24bc.html. Accessed 2026-05-21. ↩
WMDP Project, "WMDP Benchmark", wmdp.ai project page, 2024-03-05. https://www.wmdp.ai/. Accessed 2026-05-21. ↩
Center for AI Safety, "cais/wmdp dataset card", Hugging Face Datasets, 2024-03-05 (with data-quality updates 2024-03-08 and 2024-04-23). https://huggingface.co/datasets/cais/wmdp. Accessed 2026-05-21. ↩
Nathaniel Li et al., "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning", arXiv:2403.03218 HTML rendering of v7, 2024-05-15. https://arxiv.org/html/2403.03218v7. Accessed 2026-05-21. ↩
Center for AI Safety, "centerforaisafety/wmdp", GitHub repository, 2024. https://github.com/centerforaisafety/wmdp. Accessed 2026-05-21. ↩
Center for AI Safety, "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning", CAIS blog, 2024-03-05. https://safe.ai/blog/wmdp-benchmark. Accessed 2026-05-21. ↩
Scale AI, "New Safety Benchmark for Large Language Models", Scale AI blog, 2024-03-05. https://scale.com/blog/measuring-mitigating-risk-wmdp. Accessed 2026-05-21. ↩
Center for AI Safety, "AI Safety Newsletter #32: Measuring and Reducing Hazardous Knowledge in LLMs", safe.ai newsletter, 2024-03. https://newsletter.safe.ai/p/ai-safety-newsletter-32-measuring. Accessed 2026-05-21. ↩
The White House, "Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence" (EO 14110), 2023-10-30. https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/. Accessed 2026-05-21. ↩
Time Magazine, "Researchers Unveil Method to Purge AI of Dangerous Knowledge", time.com article on WMDP and CUT, 2024-03. https://time.com/6878893/ai-artificial-intelligence-dangerous-knowledge/. Accessed 2026-05-21. ↩
Sejal Sharma, "This 'weapon' can wipe the AI slate clean", Interesting Engineering, 2024-03-09. https://interestingengineering.com/innovation/this-weapon-can-wipe-the-ai-slate-clean. Accessed 2026-05-21. ↩
Center for AI Safety, "cais/wmdp-corpora dataset card", Hugging Face Datasets, 2024. https://huggingface.co/datasets/cais/wmdp-corpora. Accessed 2026-05-21. ↩
Center for AI Safety, "cais/wmdp-bio-forget-corpus", Hugging Face Datasets, 2024. https://huggingface.co/datasets/cais/wmdp-bio-forget-corpus. Accessed 2026-05-21. ↩
EleutherAI, "lm-evaluation-harness: WMDP task README", GitHub, 2024. https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/wmdp/README.md. Accessed 2026-05-21. ↩
UK AI Safety Institute, "WMDP: Measuring and Reducing Malicious Use With Unlearning", Inspect Evals documentation, 2024. https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/wmdp/. Accessed 2026-05-21. ↩
LLM-Stats, "WMDP Benchmark Leaderboard", llm-stats.com, 2026. https://llm-stats.com/benchmarks/wmdp. Accessed 2026-05-21. ↩
Eugene van der Watt, "WMDP measures and reduces LLM malicious use with unlearning", DailyAI, 2024-03-12. https://dailyai.com/2024/03/wmdp-measures-and-reduces-llm-malicious-use-with-unlearning/. Accessed 2026-05-21. ↩
Emergent Mind, "WMDP Benchmark for LLM Hazardous Knowledge", emergentmind.com topic page, 2024-2026. https://www.emergentmind.com/topics/wmdp-benchmark. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Dan Hendrycks HELM (Holistic Evaluation of Language Models)

Background

Infobox

Question generation methodology

How CUT and RMU work

Forget loss and retain loss

Forget and retain corpora

Reported results

Implementations and adoption

Significance

Limitations and criticisms

Related work and comparison

See also

References

Improve this article

Related Articles

Humanity's Last Exam

METR

SimpleQA

TruthfulQA

HaluEval

MACHIAVELLI (benchmark)

What links here

Related Articles

Humanity's Last Exam

METR

SimpleQA

TruthfulQA

HaluEval

MACHIAVELLI (benchmark)

What links here