The Center for AI Safety (CAIS) is an American nonprofit research and advocacy organization based in San Francisco, California, dedicated to reducing societal-scale risks from artificial intelligence. Founded in 2022 by Dan Hendrycks and Oliver Zhang, CAIS pursues its mission through three pillars: technical research on AI safety, field-building to grow the community of safety researchers, and policy advocacy to promote responsible AI governance. The organization gained widespread public attention in May 2023 when it published a one-sentence "Statement on AI Risk" that was signed by hundreds of leading AI researchers, executives, and public figures, comparing the risk of extinction from AI to the threats posed by pandemics and nuclear war.
CAIS was founded in 2022 by Dan Hendrycks and Oliver Zhang shortly after Hendrycks completed his Ph.D. in Computer Science at the University of California, Berkeley, where he was advised by Dawn Song and Jacob Steinhardt. Before establishing CAIS, Hendrycks had already made significant contributions to machine learning, including the introduction of the GELU activation function in 2016 and the MMLU (Massive Multitask Language Understanding) benchmark in 2020, both of which became widely adopted across the AI research community.
The founding of CAIS reflected a growing concern among AI researchers that the rapid progress in AI capabilities was outpacing efforts to ensure that these systems would remain safe and aligned with human values. Hendrycks and Zhang established the organization to fill a gap in the AI safety ecosystem by combining original research with practical field-building and public advocacy.
The organization is headquartered in San Francisco. In July 2024, CAIS formally expanded its presence to Washington, D.C., launching both CAIS and the CAIS Action Fund in the capital with a public event featuring keynote addresses by Senator Brian Schatz (D-HI) and Representative French Hill (R-AR), attended by over 100 stakeholders and policymakers.
The stated mission of CAIS is: "Our mission is to reduce societal-scale risks from artificial intelligence." To carry out this mission, the organization operates through four functional teams:
| Team | Role |
|---|---|
| Research Team | Conducts conceptual and empirical AI safety research |
| Cloud and DevOps Team | Manages the compute cluster supporting approximately 20 research labs |
| Projects Team | Oversees field-building initiatives, collaborations, and advisory work |
| Operations Team | Provides organizational support including tools, processes, and personnel |
Dan Hendrycks serves as the Executive and Research Director of CAIS. He earned his B.S. from the University of Chicago in 2018 and his Ph.D. from UC Berkeley in 2022. His research focuses on machine learning safety, machine ethics, and robustness. In addition to founding CAIS, Hendrycks serves as the safety adviser to xAI, the AI company founded by Elon Musk, and as an adviser to Scale AI. In both advisory roles, he collects a symbolic one-dollar salary and holds no equity, in order to avoid conflicts of interest.
Hendrycks has received wide recognition for his work. He was named to the inaugural TIME100 AI list in 2023, the Vox Future Perfect 50 list in 2024, and the Forbes 30 Under 30 list for AI in 2025. He is also an AI2050 Fellow supported by Schmidt Sciences.
Oliver Zhang co-founded CAIS and serves as its Managing Director. Zhang oversees the operational and strategic direction of the organization.
Josue Estrada serves as Chief Operating Officer. The organization's board includes Jaan Tallinn, the co-founder of Skype and a prominent funder of existential risk research through the Survival and Flourishing Fund and other philanthropic vehicles.
CAIS receives philanthropic support from several major donors in the AI safety and effective altruism space. Open Philanthropy (now known as Coefficient Giving), a grantmaking organization primarily funded by Facebook co-founder Dustin Moskovitz and his wife Cari Tuna, has provided general support grants to CAIS in both 2022 and 2023 through its Navigating Transformative AI fund. Additional funding has come from Schmidt Sciences and other donors focused on reducing catastrophic and existential risks from advanced AI.
On May 30, 2023, CAIS published a single-sentence statement that quickly became one of the most widely discussed public declarations about AI risk:
"Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war."
The idea for a concise, one-sentence statement was originally proposed by David Scott Krueger, then a professor at the University of Cambridge. The brevity was intentional: the goal was to produce a statement so clear and direct that a wide range of experts could sign it without extensive debate over specific policy positions.
At the time of release, the statement had been signed by more than 350 leading AI researchers and public figures. Signatories included two of the most-cited computer scientists in history and Turing Award laureates Geoffrey Hinton and Yoshua Bengio. The statement was also signed by the CEOs and leaders of major AI companies, including Sam Altman (CEO of OpenAI), Dario Amodei (CEO of Anthropic), Demis Hassabis (CEO of Google DeepMind), Mustafa Suleyman (then CEO of Inflection AI), Emad Mostaque (then CEO of Stability AI), and Adam D'Angelo (CEO of Quora). Other notable signatories included Bill Gates and Elon Musk.
The number of signatories eventually grew to over 600, encompassing AI professors, philosophers, ethicists, legal scholars, economists, physicists, political scientists, pandemic researchers, nuclear scientists, and climate scientists. Over 1,000 researchers from organizations including DeepMind, OpenAI, Meta AI, and other institutions endorsed the statement.
The statement received extensive media coverage and helped bring the topic of AI existential risk into mainstream public discourse. It was covered by The New York Times, The Washington Post, CNN, BBC, and numerous other outlets worldwide.
CAIS conducts technical and conceptual research focused exclusively on mitigating societal-scale risks from AI. The organization prioritizes transparency and accessibility, publishing its findings at top machine learning conferences. Key research areas include robustness, monitoring, AI alignment, transparency, and the study of dangerous AI behaviors such as deception, power-seeking, and unethical conduct.
In October 2023, CAIS researchers published "Representation Engineering: A Top-Down Approach to AI Transparency," authored by Andy Zou, Long Phan, Sarah Chen, and collaborators including Dan Hendrycks. Drawing on insights from cognitive neuroscience, the paper introduced representation engineering (RepE), which places population-level representations at the center of analysis rather than individual neurons or circuits. RepE provides methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks, with demonstrated applications to safety-relevant problems including honesty, harmlessness, and power-seeking behaviors.
CAIS published breakthrough research in 2024 demonstrating how "circuit breakers" can prevent AI models from producing dangerous outputs by interrupting the internal representations responsible for harmful content. Inspired by representation engineering, the approach directly controls the representations that generate harmful outputs and can be applied to both text-only and multimodal language models. In a public jailbreaking competition with substantial prize money, it took approximately twenty thousand attempts to successfully jailbreak a model trained with circuit breakers, demonstrating strong resilience against adversarial attacks. The paper was accepted to NeurIPS 2024.
HarmBench is a standardized evaluation framework for automated red teaming developed by CAIS. The framework addresses the lack of standardized evaluations for red teaming methods by defining three desirable properties for such evaluations: breadth, comparability, and robust metrics. The initial release included evaluations of 33 large language models and 18 red teaming methods. HarmBench was used by the US and UK AI Safety Institutes in their pre-deployment testing of Claude 3.5 Sonnet. The paper was accepted to ICML 2024.
The Weapons of Mass Destruction Proxy (WMDP) Benchmark is a dataset of 4,157 multiple-choice questions created by CAIS to measure hazardous knowledge in large language models across biosecurity, cybersecurity, and chemical security domains. The benchmark was designed in response to the White House Executive Order on Artificial Intelligence, which highlighted the risks of LLMs empowering malicious actors. Importantly, the questions were carefully crafted to exclude genuinely hazardous information, instead using proxy knowledge that correlates with dangerous capabilities. Alongside the benchmark, the researchers introduced CUT (Contrastive Unlearning from Teacher), an unlearning method that removes hazardous knowledge while retaining general model capabilities. The paper appeared at ICML 2024.
The MACHIAVELLI benchmark, developed in collaboration with UC Berkeley, Carnegie Mellon University, and Yale University, evaluates the ethical behavior of AI agents in text-based interactive environments. Built on 134 Choose-Your-Own-Adventure games containing over 572,000 multi-paragraph scenes, the benchmark provides 13 measures of ethical behavior covering deception, rule-following, power-seeking, and utility. The research demonstrated that while baseline AI agents face trade-offs between ethical behavior and reward maximization, agents can be systematically steered toward safer behavior through policy-shaping and prompt-based approaches. The paper was published as an oral presentation at ICML 2023.
CAIS researchers published "An Overview of Catastrophic AI Risks," a comprehensive conceptual paper detailing risk scenarios and mitigation strategies. The paper organizes catastrophic AI risks into four categories:
| Risk Category | Description |
|---|---|
| Malicious Use | Individuals or groups intentionally using AI to cause harm, including autonomous weapons or engineering pandemics |
| AI Race | Competitive pressures compelling actors to deploy unsafe AI systems or cede control to AI |
| Organizational Risks | Human factors and complex systems increasing the probability of catastrophic accidents |
| Rogue AIs | The inherent difficulty of controlling AI systems that may exceed human performance across many tasks |
In 2024, CAIS published research arguing that many existing safety benchmarks primarily correlate with general AI capabilities rather than measuring distinct safety improvements. This critique challenged the practice of "safetywashing," where improvements on general capabilities tests are presented as evidence of improved safety. The paper was accepted to NeurIPS 2024.
CAIS developed methods for embedding safeguards into open-weight language models that resist removal through fine-tuning. This research addresses the challenge of ensuring safety properties persist even when model weights are publicly available and users can modify them.
Dan Hendrycks, in his individual research career, created several foundational benchmarks that are widely used across the AI industry:
| Benchmark | Year | Description |
|---|---|---|
| ImageNet-C | 2019 | Robustness evaluation benchmark measuring model performance under common image corruptions |
| MMLU | 2020 | 57-subject knowledge evaluation covering history, science, math, law, medicine, and more |
| MATH | 2021 | Mathematical problem-solving benchmark for language models |
| Humanity's Last Exam | 2025 | Expert-level benchmark with 2,500 questions across 100+ academic subjects |
Humanity's Last Exam (HLE) is a flagship benchmark project launched by CAIS in collaboration with Scale AI. The project aimed to create the most challenging AI benchmark ever assembled, crowdsourcing expert-level questions from researchers worldwide. Over 1,200 collaborators contributed, submitting more than 70,000 trial questions. After expert review of 13,000 candidates, the final benchmark consists of 2,500 graduate-level questions spanning over 100 academic subjects in mathematics, humanities, and the natural sciences.
Hendrycks has stated that the project was partly inspired by a conversation with Elon Musk, who believed that existing language model benchmarks were too easy to adequately measure frontier AI capabilities.
When the results were released on January 23, 2025, leading AI models scored in the single digits. GPT-4o achieved only 2.7% accuracy, Claude 3.5 Sonnet reached 4.1%, and OpenAI's o1 model scored 8.0%. The results were covered by The New York Times and The Economist, among other publications.
CAIS operates a compute cluster that provides free GPU access to AI safety researchers worldwide. The cluster was established to address a key barrier in safety research: many researchers working on AI safety problems lacked access to the large-scale compute resources necessary to run experiments on frontier-scale models.
The cluster consists of 10 GPU nodes running on Oracle Cloud bare-metal servers, equipped with 80 NVIDIA A100 GPUs. The infrastructure uses Ansible and Terraform for cluster management, Slurm for job scheduling, and WekaFS for the shared distributed parallel filesystem.
As of the end of 2024, the compute cluster has supported approximately 350 researchers over its lifetime, a figure that grew roughly sixfold in 18 months. The cluster has enabled the production of 109 cumulative research papers with over 4,000 citations. In 2024 alone, researchers using the cluster published 77 new papers.
CAIS invests heavily in growing the AI safety research community through educational programs, competitions, fellowships, and community events.
In December 2024, Dan Hendrycks published the textbook Introduction to AI Safety, Ethics, and Society through Taylor & Francis. The book provides a comprehensive and accessible exploration of AI risks, technical and organizational tools for making AI systems safer, and the ethical and societal challenges involved in ensuring beneficial outcomes from AI. An accompanying free online course was launched alongside the textbook, with 240 participants enrolled in its initial offering. An earlier version of the courseware, titled "Intro to ML Safety," had been available as a free online course covering robustness, monitoring, control, and systemic safety.
The CAIS Philosophy Fellowship is a seven-month research program that brings philosophers to San Francisco to study the societal implications and potential risks of advanced AI. The 2023 cohort of fellows attended guest lectures from prominent philosophers and AI researchers, including Peter Railton, Shelly Kagan, and Nick Bostrom, as well as AI professors and researchers at Berkeley, Cambridge, and DeepMind. The fellows collectively produced eighteen original research papers covering topics such as interpretability, corrigibility, and multipolar scenarios. The program also resulted in a special issue journal publication in Philosophical Studies.
SafeBench is a competition organized by CAIS, sponsored by Schmidt Sciences, offering $250,000 in prizes to stimulate the development of new benchmarks for assessing and reducing AI risks. The competition received nearly 100 submissions across four categories: Robustness, Monitoring, Alignment, and Safety Applications. Eight winning submissions were selected, with three receiving a first prize of $50,000 each and five receiving $20,000 each. Notable winning benchmarks included:
| Benchmark | Description |
|---|---|
| Cybench | 40 professional-level Capture the Flag tasks across six cybersecurity categories |
| SAD (Situational Awareness) | Over 13,000 questions testing LLM situational awareness across 7 task categories |
| BioLP-bench | Evaluates language models' ability to find and correct mistakes in biological laboratory protocols |
The Trojan Detection Challenge 2023 (LLM Edition) was a NeurIPS 2023 competition organized by CAIS to advance methods for detecting hidden functionality in large language models. The competition featured two tracks: a Trojan Detection Track, where participants searched for trigger phrases of hundreds of trojans inserted into LLMs, and a Red Teaming Track, where participants generated prompts to elicit harmful behaviors from Llama 2 chat models. The competition offered $30,000 in prizes and attracted over 125 teams who submitted more than 3,400 entries. Results were presented at a workshop at NeurIPS 2023.
CAIS regularly hosts workshops, socials, and events at major AI conferences. At NeurIPS 2024, the organization ran a workshop that received 51 submissions, of which 34 were accepted. CAIS also organized social events at ICML and ICLR with over 200 attendees. The organization publishes the AI Safety Newsletter, which grew to over 24,000 subscribers by the end of 2024, tripling its subscriber base during that year.
Across all of its programs, CAIS has trained approximately 500 students in AI safety concepts and methods. Over 500 researchers have participated in CAIS-organized events, and the Trojan Detection Challenge alone drew over 1,200 competition submissions from more than 70 teams.
In July 2023, CAIS established the Center for AI Safety Action Fund (CAIS AF), a 501(c)(4) nonpartisan advocacy organization dedicated to advancing public policies that maintain U.S. leadership in AI while protecting against AI-related national security threats. The Action Fund brings together lawmakers, business leaders, national security experts, nongovernmental organizations, and machine learning engineers to raise awareness of AI risks and promote legislation.
Key policy focus areas for the CAIS Action Fund include chip security, location verification for AI hardware, and supporting the Bureau of Industry and Security (BIS) in its capacity to regulate AI-related exports. In 2024, the CAIS Action Fund spent $270,000 on lobbying activities.
The CAIS Action Fund organized and co-led a joint letter signed by more than 80 technology organizations asking Congress to fully fund the National Institute of Standards and Technology's (NIST) work on AI. Through bipartisan congressional engagement, the organization successfully advocated for $10 million in funding for the U.S. AI Safety Institute.
CAIS co-sponsored California Senate Bill 1047, the Safe and Secure Innovation for Frontier Artificial Intelligence Models Act, which proposed safety requirements for developers of frontier AI models. The bill gathered support from over 70 academic researchers, more than 4,000 supporters who called the governor's office, and over 7,000 petition signers. The bill was ultimately vetoed by Governor Gavin Newsom in September 2024.
CAIS occupies a distinct position in the AI safety ecosystem. Unlike organizations such as the Machine Intelligence Research Institute (MIRI), which focuses primarily on theoretical alignment research, or the Future of Life Institute (FLI), which emphasizes policy and outreach, CAIS combines empirical technical research with field-building and public advocacy. The organization has collaborated with academic institutions including UC Berkeley, Carnegie Mellon University, and Yale University on research projects.
CAIS has worked with government entities including the U.S. and U.K. AI Safety Institutes, which used the HarmBench framework for pre-deployment safety evaluations. Through Dan Hendrycks's advisory roles, CAIS also maintains connections to industry, including xAI and Scale AI.
The organization is part of a broader network of AI safety organizations that includes Anthropic, the Alignment Research Center (ARC), and various academic labs. CAIS's board member Jaan Tallinn also serves on the boards of the Future of Life Institute and the Centre for the Study of Existential Risk at the University of Cambridge.
By the end of 2024, CAIS had produced or supported over 170 AI safety research papers across its various programs. The organization's compute cluster alone contributed to 109 papers with over 4,000 citations. CAIS's benchmarks, including HarmBench, WMDP, and MACHIAVELLI, have been adopted by AI laboratories and government safety institutes for evaluating model safety.
The Statement on AI Risk remains one of the most widely cited public declarations on AI safety, and it played a significant role in shifting public perception of AI risks from a fringe concern to a mainstream policy issue. Dan Hendrycks's inclusion on the TIME100 AI list in 2023 further elevated the organization's profile.
The following table lists selected major publications by CAIS researchers:
| Publication | Year | Venue | Topic |
|---|---|---|---|
| Representation Engineering: A Top-Down Approach to AI Transparency | 2023 | Preprint | AI transparency and interpretability |
| Do the Rewards Justify the Means? (MACHIAVELLI) | 2023 | ICML 2023 (Oral) | Ethical behavior of AI agents |
| An Overview of Catastrophic AI Risks | 2023 | Preprint | Risk taxonomy and mitigation |
| HarmBench: A Standardized Evaluation Framework for Automated Red Teaming | 2024 | ICML 2024 | Red teaming evaluation |
| The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning | 2024 | ICML 2024 | Hazardous knowledge assessment |
| Improving Alignment and Robustness with Circuit Breakers | 2024 | NeurIPS 2024 | Jailbreak resistance |
| Safetywashing: Do AI Safety Benchmarks Actually Measure Safety? | 2024 | NeurIPS 2024 | Benchmark critique |
| Humanity's Last Exam | 2025 | Preprint | Expert-level AI benchmark |
| Introduction to AI Safety, Ethics, and Society (Textbook) | 2024 | Taylor & Francis | AI safety education |