# WMDP benchmark

> Source: https://aiwiki.ai/wiki/wmdp
> Updated: 2026-07-16
> Categories: AI Benchmarks, AI Safety
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

The **Weapons of Mass Destruction Proxy (WMDP)** is a publicly released benchmark and unlearning testbed for large language models, introduced in the paper "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning" by Nathaniel Li, Alexander Pan, Anjali Gopal, and over fifty co-authors, posted to arXiv on 5 March 2024 (arXiv:2403.03218) and published at the International Conference on Machine Learning (ICML) 2024.[^1][^2] The dataset, hosted at `cais/wmdp` on Hugging Face and at `wmdp.ai`, is a collection of expert-authored, multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security; the original release contained 4,157 questions, and the project's actively maintained, dual-use-filtered version currently distributes 3,668 items split across `wmdp-bio`, `wmdp-cyber`, and `wmdp-chem` configurations.[^1][^3][^4] In the same paper the authors introduced **CUT** (Contrastive Unlearn Tuning), later simplified and renamed **RMU** (Representation Misdirection for Unlearning), a fine-tuning method that perturbs model activations on hazardous text while preserving activations on benign text, reducing performance on WMDP roughly to chance while largely retaining capability on the [MMLU](/wiki/mmlu) and [MT-Bench](/wiki/mt_bench) evaluations.[^1][^5][^6] WMDP was developed by the [Center for AI Safety](/wiki/center_for_ai_safety) (CAIS) in collaboration with [Scale AI](/wiki/scale_ai) and a consortium of more than twenty academic and industry partners, with the explicit policy ambition of giving regulators, frontier labs, and the research community a shared, open instrument for measuring catastrophic-misuse risk.[^7][^8][^9]

## Background

By late 2023 and early 2024 the question of whether [large language models](/wiki/large_language_model) could meaningfully accelerate the development of biological, chemical, or cyber weapons had moved from speculative discussion into formal policy. The White House [Executive Order on AI](/wiki/ai_executive_order) (Executive Order 14110, signed 30 October 2023) explicitly directed agencies to develop evaluations and red-teaming protocols for chemical, biological, radiological, and nuclear (CBRN) risks of dual-use foundation models, and major labs began publishing private "capability evaluations" for those domains.[^7][^9][^10] The WMDP authors framed their work as a response to a specific problem with that landscape: hazardous-capability evaluations existed but were almost exclusively private, held inside government bodies and frontier labs, which prevented the broader research community from independently studying mitigations or comparing methods.[^1][^7]

WMDP was first announced on 5 March 2024 in a joint release from CAIS and Scale AI, with a coordinated launch on the project page `wmdp.ai`, the Hugging Face dataset `cais/wmdp`, the GitHub repository `centerforaisafety/wmdp`, and the arXiv preprint.[^1][^7][^8] The launch was covered the next day in Time magazine, which reported that Scale AI and CAIS had built a public benchmark of "more than 4,000 multiple-choice questions" that probe whether a model has internalised hazardous knowledge, together with an unlearning method intended to scrub that knowledge from a model rather than merely refusing to answer.[^11] The Time article quoted [Dan Hendrycks](/wiki/dan_hendrycks), executive director of CAIS, expressing the hope that the benchmark would be adopted as one of the primary public references for open-source developers reporting safety properties of their releases.[^11][^12]

The acronym "Weapons of Mass Destruction Proxy" is deliberate: the dataset is a *proxy* for catastrophic-misuse capability rather than a direct test of it. The authors describe the benchmark as built around "precursors, neighbors, and emulations of hazardous information" so that the public release does not itself constitute a uplift hazard, and they note that the development process was guided by counsel on U.S. export-control regimes including the International Traffic in Arms Regulations (ITAR).[^7][^9]

## Infobox

| Attribute | Value |
|---|---|
| Full name | Weapons of Mass Destruction Proxy |
| First release | 5 March 2024 |
| arXiv ID | 2403.03218 |
| Venue | [ICML](/wiki/icml) 2024 |
| Lead authors | Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti (and ~50 others) |
| Senior authors | Mantas Mazeika, Andy Zou, [Dan Hendrycks](/wiki/dan_hendrycks) |
| Organizations | [Center for AI Safety](/wiki/center_for_ai_safety), [Scale AI](/wiki/scale_ai), SecureBio, academic and industry consortium |
| Initial question count | 4,157 |
| Current question count | 3,668 (after April 2024 data-quality cleanup) |
| Domains | Biosecurity, cybersecurity, chemical security |
| Format | Four-option multiple choice |
| Hugging Face slug | `cais/wmdp` |
| License | MIT |
| Project site | `wmdp.ai` |
| Companion method | CUT / RMU (Representation Misdirection for Unlearning) |

## Question generation methodology

WMDP's questions were written by domain experts rather than scraped or generated by language models. The authors recruited a consortium of academics and technical consultants in biosecurity, cybersecurity, and chemistry, and required that every candidate item be reviewed by at least two specialists from different organizations before inclusion.[^7][^9] The CAIS blog states that authors and reviewers were drawn from "over twenty academic institutions, technical consultants, and industry partners."[^7]

A key design constraint distinguishes WMDP from prior multiple-choice tests of scientific knowledge such as MMLU: questions are filtered to remove items that a competent layperson could answer without specialised, dual-use knowledge. This is intended to keep the benchmark measuring a quantity correlated with genuine uplift potential, rather than measuring general scientific literacy. The CAIS announcement describes the questions as covering "precursors, neighbors, and emulations of hazardous information" so that the public artifact is informative about a model's grasp of the surrounding technical context without itself revealing operational detail.[^7][^9]

A second filter screens out questions that the team judged too dangerous to publish; the authors report that "especially dangerous questions" are withheld and that development followed U.S. export-control guidance, including ITAR review.[^7] A third filter is post-hoc and ongoing: the dataset's Hugging Face card records that on 23 April 2024 the team removed `wmdp-cyber` questions deemed excessively long and `wmdp-bio` questions with insufficient dual-use potential, and on 8 March 2024 it adjusted `wmdp-cyber` for choice-randomization issues, taking the public total from 4,157 down to 3,668.[^3][^4]

The current public split is approximately:[^3][^4][^6]

| Subset | Approximate questions | Topic |
|---|---|---|
| `wmdp-bio` | ~1,273 | Viral engineering, pathogen biology, dual-use synthetic biology |
| `wmdp-cyber` | ~1,987 | Offensive cyber capabilities, exploitation, attack chains |
| `wmdp-chem` | ~408 | Hazardous chemical synthesis, deployment, and precursors |

Each item is a four-option multiple choice with fields `question`, `choices` (length-4 sequence of strings), and `answer` (an integer 0-3 indexing the correct option), as documented in the Hugging Face dataset card.[^4] The README also carries an explicit warning that the benchmark data should never be allowed into training corpora, because that would render the benchmark uninformative.[^4]

## How CUT and RMU work

The same paper that introduces the WMDP benchmark also introduces an unlearning method. In the earlier version of the paper (and in early press coverage) the method is called **CUT** (Contrastive Unlearn Tuning, also rendered "Controlling Unlearned Representations" in some materials); in the final ICML version and the released code the method is simplified and renamed **RMU**, for **Representation Misdirection for Unlearning**.[^1][^5][^6] The two share the same core idea, and the simplified RMU is what is implemented in `centerforaisafety/wmdp` and integrated into evaluation harnesses.[^5][^6]

### Forget loss and retain loss

RMU is a fine-tuning method inspired by [representation engineering](/wiki/representation_engineering). It is trained against a two-term objective:[^1][^5]

- A **forget loss** that pushes the residual-stream activations produced by the model on hazardous text in a specific direction, while inflating their norm. Concretely, the loss takes the squared L2 distance between the updated model's activation on a forget-corpus example and a target vector that combines a random direction and a scale, so that whatever semantic structure the original representation encoded on that token is replaced by an unstructured, off-distribution signal.
- A **retain loss** that takes the squared L2 distance between the updated model's activations on benign text and the original (frozen) model's activations on the same text, holding the model's behavior on retain-data approximately constant.

Training interleaves gradient updates on the forget and retain corpora, and (in the released configuration) updates only a small window of layers near the chosen target layer, both for memory efficiency and to localize the intervention.[^1][^5]

A practically important property highlighted in the paper is that RMU "does not assume access to the hazardous information that it intends to remove." Because the forget loss only needs a corpus of *topically related* text, not the answers to the WMDP questions themselves, developers can run RMU without holding sensitive material on their machines, which the CAIS newsletter describes as one of the method's main attractions for industrial use.[^9]

### Forget and retain corpora

The released `cais/wmdp-corpora` companion provides the text used for unlearning. The biosecurity forget corpus is a curated set of PubMed papers related to the topics that generate `wmdp-bio` items; the cyber forget corpus is built from a crawl of GitHub documents relevant to `wmdp-cyber` topics; and the retain corpus is general text such as Wikitext. The bio forget corpus must be requested by form, while the cyber forget corpus is available behind a password-gated archive.[^13][^14]

### Reported results

On Zephyr-7B-Beta, the paper reports that RMU reduces accuracy on `wmdp-bio` from 63.7% to 31.2% and on `wmdp-cyber` from 44.0% to 28.2%, while MMLU drops only from 58.1% to 57.1% and MT-Bench from 7.33 to 7.10.[^1][^5] On larger models the same recipe was applied to Yi-34B-Chat and Mixtral-8x7B-Instruct, with pre-trained unlearned checkpoints released alongside the code.[^6] The CAIS blog summarises the result as CUT/RMU "reducing model performance on WMDP questions to random chance, while leaving accuracy nearly untouched on a standard battery of general knowledge tests."[^7]

The paper also reports that linear probes trained to recover the forgotten content from the unlearned model's internal states fail to do so, and that adversarial-suffix attacks using GCG do not recover the knowledge even after thousands of optimization steps.[^5] These are intended as evidence that RMU is doing more than refusal-style suppression, although later work has disputed that claim (see Limitations).

## Implementations and adoption

WMDP has been integrated into several standard evaluation stacks since its release.

- **EleutherAI lm-evaluation-harness.** The Hugging Face-hosted dataset is wrapped by tasks `wmdp_bio`, `wmdp_cyber`, `wmdp_chemistry`, and a grouped `wmdp` task in [EleutherAI](/wiki/eleutherai)'s `lm-evaluation-harness`; the harness's README describes the benchmark in the same terms as the paper and documents the per-subset question counts.[^15]
- **UK AISI Inspect Evals.** The UK [AI Safety Institute](/wiki/ai_safety_institute) (now the AI Security Institute) ships WMDP as a built-in evaluation in its open-source Inspect framework, under `inspect_evals/wmdp`, with subtasks for biosecurity, cybersecurity, and chemical security. The framework scores models by accuracy across the multiple-choice items.[^16]
- **Reference unlearned checkpoints.** The CAIS GitHub repository ships pre-unlearned versions of Zephyr-7B, Yi-34B-Chat, and Mixtral-8x7B-Instruct via Hugging Face, allowing third parties to evaluate the quality of the released RMU artifacts directly.[^6]
- **Public leaderboards.** Third-party leaderboards such as `llm-stats.com/benchmarks/wmdp` aggregate self-reported and harness-evaluated scores from frontier models; as of mid-2026 the highest-ranked entry is Grok-4.1 Thinking at 0.840 on the combined `wmdp` task.[^17]

## Significance

WMDP fills a narrow but consequential gap in the public benchmark ecosystem. Before its release, public multiple-choice tests of scientific competence (such as MMLU and [MMLU-Pro](/wiki/mmlu-pro)) lumped together benign and dual-use knowledge, and public safety benchmarks (such as [HarmBench](/wiki/harmbench) and [JailbreakBench](/wiki/jailbreakbench)) focused on whether a model would *output* harmful content given an adversarial prompt rather than whether it had the underlying knowledge. WMDP is built specifically to factor out compliance behavior: it measures whether the knowledge is *in the model* at all, regardless of whether the model is currently willing to articulate it.[^1][^7][^9]

A second contribution is the explicit framing of WMDP as both an evaluation and a target for an intervention. The paper's stated thesis is that unlearning, rather than refusal training or post-hoc filters, may be a more durable path toward reducing malicious-use risk; WMDP provides a metric against which different unlearning methods can be compared on equal footing.[^1][^5] This framing has been picked up directly by policy commentary: the CAIS AI Safety Newsletter and the Time coverage both describe the benchmark and method together as a "concrete path toward reducing malicious use from LLMs."[^9][^11]

Adoption by the AI Safety Institutes of the UK (via Inspect Evals) and by the EleutherAI lm-evaluation-harness has made WMDP one of a small number of public evaluations that are reported on routinely in third-party assessments of frontier models, alongside HarmBench and similar benchmarks.[^15][^16]

## Limitations and criticisms

The authors and subsequent commentators have identified several limitations.

**Proxy fidelity.** WMDP is a *proxy* benchmark by construction: the public release excludes the most operationally sensitive questions, and questions focus on adjacent and precursor knowledge rather than the exact information that would constitute uplift. The paper argues that performance on WMDP correlates with performance on a held-out set of more sensitive questions, but the proxy gap remains a known source of measurement noise.[^1][^5]

**Vulnerability to relearning.** A recurring point in both the original paper and later analyses is that RMU's effects can be reversed. Once a model with unlearned weights is released, a downstream party with fine-tuning access can in many cases restore the original capability by fine-tuning on generic data; the CAIS newsletter notes explicitly that this limitation means CUT "does not mitigate risks from open-source models" and is most useful as a step applied to closed-source systems before deployment.[^9] Subsequent analyses on the AI Alignment Forum and in follow-up papers (e.g., "Unlearning via RMU is mostly shallow," and later work on representation-misdirection robustness) argue that RMU's intervention is concentrated in shallow features and can be largely recovered by simple rephrasing of the prompt or by light fine-tuning, suggesting the method functions more like a filter than true knowledge erasure.[^18]

**Sandbagging.** Because WMDP is a public, automatically scored multiple-choice test, models can in principle be trained or prompted to underperform specifically on WMDP while retaining hazardous capability elsewhere, an evasion known as "sandbagging" or "password-locking." This vulnerability has been documented in later evaluation-methodology work and is an active area of research in the regulator-oriented evaluation literature.[^19]

**Benchmark saturation.** As frontier models have improved, WMDP-Bio scores in particular have climbed well above the expert baselines reported in the paper, prompting discussion of whether the benchmark is approaching saturation and whether new, harder, expert-baselined items are needed.[^19]

**Question quality.** The April 2024 cleanup that removed roughly 500 items from the original release illustrates that not every question was clean dual-use signal; some `wmdp-bio` items were judged to lack adequate dual-use potential and were removed, and some `wmdp-cyber` items were removed for excessive length.[^4]

## Related work and comparison

WMDP sits alongside several other public benchmarks and methods that touch on dangerous-capability evaluation and on knowledge editing or removal in large language models:

| Benchmark / method | Focus | Relation to WMDP |
|---|---|---|
| [MMLU](/wiki/mmlu) | General scientific knowledge | Used by WMDP as the retain-side check that unlearning has not damaged general capability.[^1] |
| [HarmBench](/wiki/harmbench) | Behavioral red-teaming, harmful-output elicitation | Measures refusal/jailbreak behavior; WMDP measures underlying knowledge irrespective of compliance.[^7] |
| [JailbreakBench](/wiki/jailbreakbench) | Standardized jailbreak attacks | Complementary; addresses behavioral elicitation rather than knowledge presence. |
| [MACHIAVELLI](/wiki/machiavelli_benchmark) | Power-seeking and ethical agency in text games | Earlier CAIS-led safety benchmark; methodologically distinct but shares senior authorship. |
| [Representation engineering](/wiki/representation_engineering) | Intervention on internal activations | The conceptual basis from which RMU's forget loss is derived.[^1][^5] |

## See also

- [MMLU](/wiki/mmlu)
- [MMLU-Pro](/wiki/mmlu-pro)
- [HarmBench](/wiki/harmbench)
- [JailbreakBench](/wiki/jailbreakbench)
- [MACHIAVELLI benchmark](/wiki/machiavelli_benchmark)
- [Representation engineering](/wiki/representation_engineering)
- [Center for AI Safety](/wiki/center_for_ai_safety)
- [Scale AI](/wiki/scale_ai)
- [Dan Hendrycks](/wiki/dan_hendrycks)
- [AI Safety Institutes](/wiki/ai_safety_institute)
- [US AI Safety Institute](/wiki/us_aisi)
- [Executive Order on AI](/wiki/ai_executive_order)
- [Red teaming (AI)](/wiki/red_teaming)
- [EleutherAI](/wiki/eleutherai)
- [ICML](/wiki/icml)
- [MT-Bench](/wiki/mt_bench)
- [Llama 2](/wiki/llama_2)
- [Mixtral](/wiki/mixtral)
- [Reinforcement Learning from Human Feedback (RLHF)](/wiki/rlhf)
- [AI safety](/wiki/ai_safety)
- [Frontier model](/wiki/frontier_model)

## References

[^1]: Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, et al., "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning", arXiv preprint, 2024-03-05. https://arxiv.org/abs/2403.03218. Accessed 2026-05-21.
[^2]: Nathaniel Li et al., "The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning", Proceedings of the 41st International Conference on Machine Learning (ICML), 2024. https://proceedings.mlr.press/v235/li24bc.html. Accessed 2026-05-21.
[^3]: WMDP Project, "WMDP Benchmark", wmdp.ai project page, 2024-03-05. https://www.wmdp.ai/. Accessed 2026-05-21.
[^4]: Center for AI Safety, "cais/wmdp dataset card", Hugging Face Datasets, 2024-03-05 (with data-quality updates 2024-03-08 and 2024-04-23). https://huggingface.co/datasets/cais/wmdp. Accessed 2026-05-21.
[^5]: Nathaniel Li et al., "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning", arXiv:2403.03218 HTML rendering of v7, 2024-05-15. https://arxiv.org/html/2403.03218v7. Accessed 2026-05-21.
[^6]: Center for AI Safety, "centerforaisafety/wmdp", GitHub repository, 2024. https://github.com/centerforaisafety/wmdp. Accessed 2026-05-21.
[^7]: Center for AI Safety, "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning", CAIS blog, 2024-03-05. https://safe.ai/blog/wmdp-benchmark. Accessed 2026-05-21.
[^8]: Scale AI, "New Safety Benchmark for Large Language Models", Scale AI blog, 2024-03-05. https://scale.com/blog/measuring-mitigating-risk-wmdp. Accessed 2026-05-21.
[^9]: Center for AI Safety, "AI Safety Newsletter #32: Measuring and Reducing Hazardous Knowledge in LLMs", safe.ai newsletter, 2024-03. https://newsletter.safe.ai/p/ai-safety-newsletter-32-measuring. Accessed 2026-05-21.
[^10]: The White House, "Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence" (EO 14110), 2023-10-30. https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/. Accessed 2026-05-21.
[^11]: Time Magazine, "Researchers Unveil Method to Purge AI of Dangerous Knowledge", time.com article on WMDP and CUT, 2024-03. https://time.com/6878893/ai-artificial-intelligence-dangerous-knowledge/. Accessed 2026-05-21.
[^12]: Sejal Sharma, "This 'weapon' can wipe the AI slate clean", Interesting Engineering, 2024-03-09. https://interestingengineering.com/innovation/this-weapon-can-wipe-the-ai-slate-clean. Accessed 2026-05-21.
[^13]: Center for AI Safety, "cais/wmdp-corpora dataset card", Hugging Face Datasets, 2024. https://huggingface.co/datasets/cais/wmdp-corpora. Accessed 2026-05-21.
[^14]: Center for AI Safety, "cais/wmdp-bio-forget-corpus", Hugging Face Datasets, 2024. https://huggingface.co/datasets/cais/wmdp-bio-forget-corpus. Accessed 2026-05-21.
[^15]: EleutherAI, "lm-evaluation-harness: WMDP task README", GitHub, 2024. https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/wmdp/README.md. Accessed 2026-05-21.
[^16]: UK AI Safety Institute, "WMDP: Measuring and Reducing Malicious Use With Unlearning", Inspect Evals documentation, 2024. https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/wmdp/. Accessed 2026-05-21.
[^17]: LLM-Stats, "WMDP Benchmark Leaderboard", llm-stats.com, 2026. https://llm-stats.com/benchmarks/wmdp. Accessed 2026-05-21.
[^18]: Eugene van der Watt, "WMDP measures and reduces LLM malicious use with unlearning", DailyAI, 2024-03-12. https://dailyai.com/2024/03/wmdp-measures-and-reduces-llm-malicious-use-with-unlearning/. Accessed 2026-05-21.
[^19]: Emergent Mind, "WMDP Benchmark for LLM Hazardous Knowledge", emergentmind.com topic page, 2024-2026. https://www.emergentmind.com/topics/wmdp-benchmark. Accessed 2026-05-21.