WMDP benchmark
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 2,941 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 2,941 words
Add missing citations, update stale details, or suggest a clearer explanation.
The Weapons of Mass Destruction Proxy (WMDP) is a publicly released benchmark and unlearning testbed for large language models, introduced in the paper "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning" by Nathaniel Li, Alexander Pan, Anjali Gopal, and over fifty co-authors, posted to arXiv on 5 March 2024 (arXiv:2403.03218) and published at the International Conference on Machine Learning (ICML) 2024.[1][2] The dataset, hosted at cais/wmdp on Hugging Face and at wmdp.ai, is a collection of expert-authored, multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security; the original release contained 4,157 questions, and the project's actively maintained, dual-use-filtered version currently distributes 3,668 items split across wmdp-bio, wmdp-cyber, and wmdp-chem configurations.[1][3][4] In the same paper the authors introduced CUT (Contrastive Unlearn Tuning), later simplified and renamed RMU (Representation Misdirection for Unlearning), a fine-tuning method that perturbs model activations on hazardous text while preserving activations on benign text, reducing performance on WMDP roughly to chance while largely retaining capability on the MMLU and MT-Bench evaluations.[1][5][6] WMDP was developed by the Center for AI Safety (CAIS) in collaboration with Scale AI and a consortium of more than twenty academic and industry partners, with the explicit policy ambition of giving regulators, frontier labs, and the research community a shared, open instrument for measuring catastrophic-misuse risk.[7][8][9]
By late 2023 and early 2024 the question of whether large language models could meaningfully accelerate the development of biological, chemical, or cyber weapons had moved from speculative discussion into formal policy. The White House Executive Order on AI (Executive Order 14110, signed 30 October 2023) explicitly directed agencies to develop evaluations and red-teaming protocols for chemical, biological, radiological, and nuclear (CBRN) risks of dual-use foundation models, and major labs began publishing private "capability evaluations" for those domains.[7][9][10] The WMDP authors framed their work as a response to a specific problem with that landscape: hazardous-capability evaluations existed but were almost exclusively private, held inside government bodies and frontier labs, which prevented the broader research community from independently studying mitigations or comparing methods.[1][7]
WMDP was first announced on 5 March 2024 in a joint release from CAIS and Scale AI, with a coordinated launch on the project page wmdp.ai, the Hugging Face dataset cais/wmdp, the GitHub repository centerforaisafety/wmdp, and the arXiv preprint.[1][7][8] The launch was covered the next day in Time magazine, which reported that Scale AI and CAIS had built a public benchmark of "more than 4,000 multiple-choice questions" that probe whether a model has internalised hazardous knowledge, together with an unlearning method intended to scrub that knowledge from a model rather than merely refusing to answer.[11] The Time article quoted Dan Hendrycks, executive director of CAIS, expressing the hope that the benchmark would be adopted as one of the primary public references for open-source developers reporting safety properties of their releases.[11][12]
The acronym "Weapons of Mass Destruction Proxy" is deliberate: the dataset is a proxy for catastrophic-misuse capability rather than a direct test of it. The authors describe the benchmark as built around "precursors, neighbors, and emulations of hazardous information" so that the public release does not itself constitute a uplift hazard, and they note that the development process was guided by counsel on U.S. export-control regimes including the International Traffic in Arms Regulations (ITAR).[7][9]
| Attribute | Value |
|---|---|
| Full name | Weapons of Mass Destruction Proxy |
| First release | 5 March 2024 |
| arXiv ID | 2403.03218 |
| Venue | ICML 2024 |
| Lead authors | Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti (and ~50 others) |
| Senior authors | Mantas Mazeika, Andy Zou, Dan Hendrycks |
| Organizations | Center for AI Safety, Scale AI, SecureBio, academic and industry consortium |
| Initial question count | 4,157 |
| Current question count | 3,668 (after April 2024 data-quality cleanup) |
| Domains | Biosecurity, cybersecurity, chemical security |
| Format | Four-option multiple choice |
| Hugging Face slug | cais/wmdp |
| License | MIT |
| Project site | wmdp.ai |
| Companion method | CUT / RMU (Representation Misdirection for Unlearning) |
WMDP's questions were written by domain experts rather than scraped or generated by language models. The authors recruited a consortium of academics and technical consultants in biosecurity, cybersecurity, and chemistry, and required that every candidate item be reviewed by at least two specialists from different organizations before inclusion.[7][9] The CAIS blog states that authors and reviewers were drawn from "over twenty academic institutions, technical consultants, and industry partners."[7]
A key design constraint distinguishes WMDP from prior multiple-choice tests of scientific knowledge such as MMLU: questions are filtered to remove items that a competent layperson could answer without specialised, dual-use knowledge. This is intended to keep the benchmark measuring a quantity correlated with genuine uplift potential, rather than measuring general scientific literacy. The CAIS announcement describes the questions as covering "precursors, neighbors, and emulations of hazardous information" so that the public artifact is informative about a model's grasp of the surrounding technical context without itself revealing operational detail.[7][9]
A second filter screens out questions that the team judged too dangerous to publish; the authors report that "especially dangerous questions" are withheld and that development followed U.S. export-control guidance, including ITAR review.[7] A third filter is post-hoc and ongoing: the dataset's Hugging Face card records that on 23 April 2024 the team removed wmdp-cyber questions deemed excessively long and wmdp-bio questions with insufficient dual-use potential, and on 8 March 2024 it adjusted wmdp-cyber for choice-randomization issues, taking the public total from 4,157 down to 3,668.[3][4]
The current public split is approximately:[3][4][6]
| Subset | Approximate questions | Topic |
|---|---|---|
wmdp-bio | ~1,273 | Viral engineering, pathogen biology, dual-use synthetic biology |
wmdp-cyber | ~1,987 | Offensive cyber capabilities, exploitation, attack chains |
wmdp-chem | ~408 | Hazardous chemical synthesis, deployment, and precursors |
Each item is a four-option multiple choice with fields question, choices (length-4 sequence of strings), and answer (an integer 0-3 indexing the correct option), as documented in the Hugging Face dataset card.[4] The README also carries an explicit warning that the benchmark data should never be allowed into training corpora, because that would render the benchmark uninformative.[4]
The same paper that introduces the WMDP benchmark also introduces an unlearning method. In the earlier version of the paper (and in early press coverage) the method is called CUT (Contrastive Unlearn Tuning, also rendered "Controlling Unlearned Representations" in some materials); in the final ICML version and the released code the method is simplified and renamed RMU, for Representation Misdirection for Unlearning.[1][5][6] The two share the same core idea, and the simplified RMU is what is implemented in centerforaisafety/wmdp and integrated into evaluation harnesses.[5][6]
RMU is a fine-tuning method inspired by representation engineering. It is trained against a two-term objective:[1][5]
Training interleaves gradient updates on the forget and retain corpora, and (in the released configuration) updates only a small window of layers near the chosen target layer, both for memory efficiency and to localize the intervention.[1][5]
A practically important property highlighted in the paper is that RMU "does not assume access to the hazardous information that it intends to remove." Because the forget loss only needs a corpus of topically related text, not the answers to the WMDP questions themselves, developers can run RMU without holding sensitive material on their machines, which the CAIS newsletter describes as one of the method's main attractions for industrial use.[9]
The released cais/wmdp-corpora companion provides the text used for unlearning. The biosecurity forget corpus is a curated set of PubMed papers related to the topics that generate wmdp-bio items; the cyber forget corpus is built from a crawl of GitHub documents relevant to wmdp-cyber topics; and the retain corpus is general text such as Wikitext. The bio forget corpus must be requested by form, while the cyber forget corpus is available behind a password-gated archive.[13][14]
On Zephyr-7B-Beta, the paper reports that RMU reduces accuracy on wmdp-bio from 63.7% to 31.2% and on wmdp-cyber from 44.0% to 28.2%, while MMLU drops only from 58.1% to 57.1% and MT-Bench from 7.33 to 7.10.[1][5] On larger models the same recipe was applied to Yi-34B-Chat and Mixtral-8x7B-Instruct, with pre-trained unlearned checkpoints released alongside the code.[6] The CAIS blog summarises the result as CUT/RMU "reducing model performance on WMDP questions to random chance, while leaving accuracy nearly untouched on a standard battery of general knowledge tests."[7]
The paper also reports that linear probes trained to recover the forgotten content from the unlearned model's internal states fail to do so, and that adversarial-suffix attacks using GCG do not recover the knowledge even after thousands of optimization steps.[5] These are intended as evidence that RMU is doing more than refusal-style suppression, although later work has disputed that claim (see Limitations).
WMDP has been integrated into several standard evaluation stacks since its release.
wmdp_bio, wmdp_cyber, wmdp_chemistry, and a grouped wmdp task in EleutherAI's lm-evaluation-harness; the harness's README describes the benchmark in the same terms as the paper and documents the per-subset question counts.[15]inspect_evals/wmdp, with subtasks for biosecurity, cybersecurity, and chemical security. The framework scores models by accuracy across the multiple-choice items.[16]llm-stats.com/benchmarks/wmdp aggregate self-reported and harness-evaluated scores from frontier models; as of mid-2026 the highest-ranked entry is Grok-4.1 Thinking at 0.840 on the combined wmdp task.[17]WMDP fills a narrow but consequential gap in the public benchmark ecosystem. Before its release, public multiple-choice tests of scientific competence (such as MMLU and MMLU-Pro) lumped together benign and dual-use knowledge, and public safety benchmarks (such as HarmBench and JailbreakBench) focused on whether a model would output harmful content given an adversarial prompt rather than whether it had the underlying knowledge. WMDP is built specifically to factor out compliance behavior: it measures whether the knowledge is in the model at all, regardless of whether the model is currently willing to articulate it.[1][7][9]
A second contribution is the explicit framing of WMDP as both an evaluation and a target for an intervention. The paper's stated thesis is that unlearning, rather than refusal training or post-hoc filters, may be a more durable path toward reducing malicious-use risk; WMDP provides a metric against which different unlearning methods can be compared on equal footing.[1][5] This framing has been picked up directly by policy commentary: the CAIS AI Safety Newsletter and the Time coverage both describe the benchmark and method together as a "concrete path toward reducing malicious use from LLMs."[9][11]
Adoption by the AI Safety Institutes of the UK (via Inspect Evals) and by the EleutherAI lm-evaluation-harness has made WMDP one of a small number of public evaluations that are reported on routinely in third-party assessments of frontier models, alongside HarmBench and similar benchmarks.[15][16]
The authors and subsequent commentators have identified several limitations.
Proxy fidelity. WMDP is a proxy benchmark by construction: the public release excludes the most operationally sensitive questions, and questions focus on adjacent and precursor knowledge rather than the exact information that would constitute uplift. The paper argues that performance on WMDP correlates with performance on a held-out set of more sensitive questions, but the proxy gap remains a known source of measurement noise.[1][5]
Vulnerability to relearning. A recurring point in both the original paper and later analyses is that RMU's effects can be reversed. Once a model with unlearned weights is released, a downstream party with fine-tuning access can in many cases restore the original capability by fine-tuning on generic data; the CAIS newsletter notes explicitly that this limitation means CUT "does not mitigate risks from open-source models" and is most useful as a step applied to closed-source systems before deployment.[9] Subsequent analyses on the AI Alignment Forum and in follow-up papers (e.g., "Unlearning via RMU is mostly shallow," and later work on representation-misdirection robustness) argue that RMU's intervention is concentrated in shallow features and can be largely recovered by simple rephrasing of the prompt or by light fine-tuning, suggesting the method functions more like a filter than true knowledge erasure.[18]
Sandbagging. Because WMDP is a public, automatically scored multiple-choice test, models can in principle be trained or prompted to underperform specifically on WMDP while retaining hazardous capability elsewhere, an evasion known as "sandbagging" or "password-locking." This vulnerability has been documented in later evaluation-methodology work and is an active area of research in the regulator-oriented evaluation literature.[19]
Benchmark saturation. As frontier models have improved, WMDP-Bio scores in particular have climbed well above the expert baselines reported in the paper, prompting discussion of whether the benchmark is approaching saturation and whether new, harder, expert-baselined items are needed.[19]
Question quality. The April 2024 cleanup that removed roughly 500 items from the original release illustrates that not every question was clean dual-use signal; some wmdp-bio items were judged to lack adequate dual-use potential and were removed, and some wmdp-cyber items were removed for excessive length.[4]
WMDP sits alongside several other public benchmarks and methods that touch on dangerous-capability evaluation and on knowledge editing or removal in large language models:
| Benchmark / method | Focus | Relation to WMDP |
|---|---|---|
| MMLU | General scientific knowledge | Used by WMDP as the retain-side check that unlearning has not damaged general capability.[1] |
| HarmBench | Behavioral red-teaming, harmful-output elicitation | Measures refusal/jailbreak behavior; WMDP measures underlying knowledge irrespective of compliance.[7] |
| JailbreakBench | Standardized jailbreak attacks | Complementary; addresses behavioral elicitation rather than knowledge presence. |
| MACHIAVELLI | Power-seeking and ethical agency in text games | Earlier CAIS-led safety benchmark; methodologically distinct but shares senior authorship. |
| Representation engineering | Intervention on internal activations | The conceptual basis from which RMU's forget loss is derived.[1][5] |