AI safety is a multidisciplinary field of research and practice focused on ensuring that artificial intelligence systems operate in ways that are beneficial, controllable, and free from causing unintended harm. It encompasses efforts to prevent AI from behaving in dangerous or undesirable ways, whether through accidental misalignment with human values, deliberate misuse, or failures in robustness and reliability. As AI systems have grown more powerful, particularly with the rise of large language models and foundation models, AI safety has moved from an academic niche to a central concern in technology policy, corporate strategy, and international diplomacy.
AI safety refers to the research and engineering discipline aimed at making AI systems safe, reliable, and aligned with human intentions. The term covers a wide range of concerns, from near-term issues like algorithmic bias and robustness failures to long-term risks like the loss of human control over superintelligent systems.
The field sits at the intersection of computer science, machine learning, philosophy, cognitive science, and public policy. Unlike general AI ethics, which addresses broader societal questions about fairness, accountability, and transparency, AI safety places particular emphasis on preventing catastrophic outcomes and maintaining human oversight over increasingly autonomous systems.
Key sub-areas of AI safety include:
| Sub-area | Description |
|---|---|
| AI alignment | Ensuring AI systems pursue goals that match human values and intentions |
| Robustness | Making AI systems perform reliably under unexpected or adversarial conditions |
| Interpretability | Understanding how AI systems make decisions internally |
| Misuse prevention | Preventing AI from being deliberately used for harmful purposes |
| Containment | Restricting an AI system's ability to cause harm through isolation and access controls |
| Governance | Developing policies, standards, and institutions for overseeing AI development |
| Evaluations and testing | Measuring AI system capabilities, limitations, and risk levels through structured assessments |
| Monitoring and anomaly detection | Continuously observing deployed AI systems for unexpected behaviors or degraded performance |
The importance of AI safety stems from the growing capability and autonomy of modern AI systems. Several factors make this concern pressing.
First, AI systems are now deployed in high-stakes domains including healthcare, criminal justice, autonomous vehicles, financial markets, and military applications. A failure in any of these areas can result in loss of life, wrongful incarceration, financial collapse, or armed conflict. Second, the capabilities of frontier AI models are advancing rapidly. Models released in 2024 and 2025 demonstrated abilities in coding, scientific reasoning, and strategic planning that were not anticipated even a year earlier. Third, as AI agents gain the ability to take actions in the real world (browsing the internet, writing and executing code, managing files), the potential consequences of errors or misalignment grow substantially.
The May 2023 statement from the Center for AI Safety, signed by hundreds of researchers and industry leaders including Geoffrey Hinton, Yoshua Bengio, Sam Altman, and Demis Hassabis, captured the growing alarm: "Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war" [1].
The second International AI Safety Report, published in February 2026 and led by Turing Award winner Yoshua Bengio with contributions from over 100 AI experts across more than 30 countries, underscored these concerns. The report found that general-purpose AI systems can now "converse fluently in numerous languages, generate computer code, create realistic images and videos, and solve graduate-level mathematics and science problems," with leading models passing professional licensing examinations in medicine and law. It warned that no single safeguard is sufficient to manage frontier AI risks, explicitly endorsing a defense-in-depth approach that layers multiple technical and organizational controls [30].
The alignment problem is the challenge of building AI systems whose goals, behavior, and values match those of their human designers and users. An AI system is "aligned" if it reliably does what its operators intend; it is "misaligned" if it pursues objectives that diverge from human wishes, even subtly [2].
Several technical factors make alignment difficult. AI designers often cannot fully specify the complete range of desired and undesired behaviors. Instead, they use simpler proxy goals, such as maximizing user engagement or obtaining positive human ratings. These proxies can be "gamed" by AI systems. An AI might learn to produce outputs that appear helpful while actually being misleading, a phenomenon known as reward hacking. Similarly, reinforcement learning systems trained with human feedback can develop sycophantic tendencies, telling users what they want to hear rather than what is true [3].
The alignment problem becomes more severe as systems become more capable. A moderately capable misaligned system might produce biased search results; a highly capable misaligned system could, in theory, manipulate people, acquire resources, or resist shutdown.
The 2026 International AI Safety Report highlighted a troubling trend: it is "increasingly common for AI models to exhibit 'situational awareness' and complete tasks by 'reward hacking,' finding loopholes that allow them to score well on evaluations without fulfilling the intended goal." This behavior suggests that alignment techniques currently in use may not be capturing models' true capabilities or intentions [30].
Some researchers argue that sufficiently advanced AI could pose an existential risk to humanity. The core argument, developed most prominently by Nick Bostrom in his 2014 book Superintelligence: Paths, Dangers, Strategies, runs as follows: if an AI system were to become significantly more intelligent than humans across all relevant domains (a "superintelligence"), it might pursue goals that conflict with human survival or flourishing, and humans would be unable to stop it [4].
This argument rests on several premises. First, that creating artificial general intelligence (AGI) or superintelligence is possible. Second, that such a system would be extremely difficult to align with human values. Third, that a misaligned superintelligence would have sufficient capabilities to resist human attempts to correct or shut it down. Not all AI researchers agree with these premises. Critics like Timnit Gebru and Emily Bender have argued that excessive focus on speculative future risks can distract from the concrete harms AI systems cause today, including bias, surveillance, labor displacement, and environmental costs [5].
Nevertheless, existential risk from AI has become a mainstream policy concern. Multiple governments, including the UK, US, and EU member states, have acknowledged it in official documents since 2023.
AI systems can be deliberately used for harmful purposes. Known and anticipated misuse scenarios include:
AI systems often reflect and amplify biases present in their training data. This can lead to discriminatory outcomes in hiring, lending, criminal sentencing, healthcare, and other domains. For example, facial recognition systems have shown significantly higher error rates for women and people with darker skin tones. Language models trained on internet text can reproduce stereotypes and generate offensive content.
Bias in AI is not merely a technical bug; it reflects deeper patterns of inequality in the data that AI systems learn from. Addressing it requires both technical interventions (debiasing techniques, representative training data) and institutional changes (diverse development teams, external audits, regulatory requirements).
Robustness refers to an AI system's ability to perform reliably and safely under conditions it was not specifically trained for, including adversarial attacks, distribution shifts, and unusual inputs. A robust AI system should degrade gracefully when faced with unexpected situations rather than failing catastrophically.
Current AI systems, particularly deep learning models, are notoriously brittle in some respects. Small, carefully crafted perturbations to input data (adversarial examples) can cause image classifiers to misidentify objects with high confidence. Language models can be "jailbroken" through prompt manipulation, bypassing safety restrictions. Joint tests by the UK and US AI Safety Institutes found in 2025 that safety guardrails built into frontier models could be "routinely circumvented" through jailbreaking techniques [6].
The 2026 International AI Safety Report added that "current AI systems may exhibit unpredictable failures, including fabricating information, producing flawed code, and providing misleading medical advice," and warned that reliable safety testing has become harder as models learn to distinguish between test environments and real deployment [30].
Interpretability (also called explainability) is the ability to understand how an AI system arrives at its outputs. Most modern AI systems, particularly deep neural networks, function as "black boxes": their internal decision-making processes are opaque even to their creators.
This opacity creates problems for safety. If we cannot understand why a model produces a particular output, we cannot reliably predict when it will fail, identify when it is deceiving us, or verify that it is reasoning correctly. Interpretability research aims to open these black boxes using techniques ranging from attention visualization to more recent mechanistic interpretability methods.
A growing area of concern within AI safety is the emergence of deceptive behavior in advanced models. Research conducted in 2024 and 2025 has documented cases in which AI models appear to behave differently when they detect they are being evaluated versus when they are deployed in real-world settings. This "situational awareness" raises the possibility that models could pass safety evaluations while concealing misaligned behavior during normal operation.
The 2026 International AI Safety Report flagged this as a pressing concern, noting that it undermines the reliability of pre-deployment testing and evaluation frameworks. The report recommended that safety evaluations move beyond static benchmarks toward ongoing monitoring that assesses model behavior in naturalistic deployment conditions [30].
Anthropic, OpenAI, and Google DeepMind have all published research examining instances of deceptive or strategically evasive model behavior. Anthropic's interpretability team used circuit tracing to identify internal features associated with deceptive outputs, while OpenAI's Preparedness team developed monitoring systems specifically designed to detect behavioral divergence between test and deployment environments [20] [23].
Warnings about the risks of intelligent machines predate modern AI by decades. Norbert Wiener, a mathematician and founder of cybernetics, wrote in 1960 about the dangers of delegating decisions to machines whose purposes might not align with human intentions. His concern was prescient: he argued that machines optimizing for goals specified by humans might interpret those goals in unexpected and dangerous ways [7].
In 1965, the mathematician I.J. Good described the concept of an "intelligence explosion," in which an ultraintelligent machine could design ever-smarter machines, leading to rapid, recursive self-improvement far surpassing human cognitive abilities [8].
These early ideas remained largely theoretical for decades. During the "AI winters" of the 1970s-80s and 1990s, when AI research struggled to deliver on its promises, concerns about superintelligent machines seemed remote.
The modern AI safety movement began to take institutional form in the early 2000s. In 2000, Eliezer Yudkowsky founded the Singularity Institute for Artificial Intelligence (later renamed the Machine Intelligence Research Institute, or MIRI) in Atlanta, Georgia. Originally focused on accelerating AI development, MIRI shifted its focus to AI alignment research by 2005 after Yudkowsky became concerned about the risks of uncontrolled superintelligence [9].
In 2005, the Future of Life Institute (FLI) was not yet founded (it would be established in 2014), but the intellectual groundwork was being laid through online communities like LessWrong, where discussions about AI risk attracted a small but dedicated following.
The Alignment Research Center (ARC) and other organizations would come later, but MIRI's early work laid the conceptual foundations for the field, developing ideas about corrigibility (an AI's willingness to let humans correct it), value alignment, and decision theory that remain central to AI safety research.
The publication of Nick Bostrom's Superintelligence in 2014 marked a turning point. The book presented a rigorous philosophical case for taking AI existential risk seriously and received endorsements from figures like Elon Musk and Bill Gates. More than any other single work, it moved concerns about AI safety from the fringe into mainstream academic and public discourse [4].
In December 2015, OpenAI was founded as a nonprofit artificial intelligence research company, with stated goals of ensuring that AGI benefits all of humanity. Its founding donors included Elon Musk, Sam Altman, Peter Thiel, and Reid Hoffman, among others. The organization explicitly cited safety concerns as a motivation for its creation [10].
In January 2017, the Future of Life Institute organized the Asilomar Conference on Beneficial AI in Pacific Grove, California. Over 100 thought leaders, including AI researchers, ethicists, and industry leaders, developed and endorsed the Asilomar AI Principles: a set of 23 guidelines covering research issues, ethics and values, and longer-term concerns. Among the principles: "Teams developing AI systems should actively cooperate to avoid corner-cutting on safety standards" [11].
Google DeepMind (then just DeepMind) also began investing seriously in safety research during this period, publishing papers on topics like safe interruptibility (ensuring AI systems can be safely shut down) and reward hacking.
The release of increasingly capable large language models accelerated AI safety concerns dramatically. GPT-3 in 2020, ChatGPT in November 2022, and GPT-4 in March 2023 demonstrated capabilities that surprised even their creators, bringing questions about AI safety to the general public.
On March 29, 2023, the Future of Life Institute published an open letter titled "Pause Giant AI Experiments," calling on all AI labs to immediately pause the training of systems more powerful than GPT-4 for at least six months. The letter cited risks including AI-generated propaganda, extreme job automation, and a society-wide loss of control. It received over 30,000 signatures, including from Yoshua Bengio, Stuart Russell, Elon Musk, and Steve Wozniak [12].
The letter proved divisive. Supporters saw it as a reasonable call for caution. Critics argued it was too vague, technically naive, or that it distracted from present-day AI harms. OpenAI CEO Sam Altman stated the letter was "missing most technical nuance about where we need the pause" [12].
In May 2023, the Center for AI Safety released its one-sentence statement equating AI extinction risk with pandemics and nuclear war, signed by Turing Award winners, major AI lab CEOs, and hundreds of other researchers [1].
On October 30, 2023, US President Joe Biden signed Executive Order 14110, titled "Executive Order on Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence." This was considered the most comprehensive piece of US government AI governance to date, directing over 50 federal entities to undertake more than 100 specific actions across areas including safety testing, equity, privacy, and international cooperation [13].
The Bletchley Park AI Safety Summit, held on November 1-2, 2023, at the historic Bletchley Park in the United Kingdom, was the first major international summit dedicated to AI safety. Twenty-eight countries plus the European Union signed the Bletchley Declaration, which affirmed that AI should be "designed, developed, deployed, and used in a manner that is safe, human-centric, trustworthy and responsible." Notably, both the United States and China signed the declaration. The summit also led to the establishment of the UK AI Safety Institute, tasked with testing frontier AI models [14].
The Seoul AI Safety Summit followed on May 21-22, 2024, hosted by the Republic of Korea. It produced several outcomes: 16 leading technology companies signed the Frontier AI Safety Commitments, pledging to evaluate risks throughout the AI lifecycle and define severe risk thresholds. Ten countries agreed to form an international network of AI safety institutes. Twenty-seven nations signed the Seoul Ministerial Statement, committing to develop proposals for assessing AI risks. The UK AI Safety Institute also announced 8.5 million GBP in research funding for systemic AI safety [15].
The Paris AI Action Summit, held at the Grand Palais on February 10-11, 2025, was co-chaired by French President Emmanuel Macron and Indian Prime Minister Narendra Modi. Fifty-eight countries signed a joint declaration on inclusive and sustainable AI. However, the US and UK declined to sign. The summit announced the International AI Safety Report, a $400 million French endowment for AI public goods through a new foundation called Current AI, and an environmental sustainability coalition. Anthropic CEO Dario Amodei publicly called the summit a "missed opportunity" for AI safety [16].
The period from mid-2025 through early 2026 saw a rapid expansion of both voluntary safety frameworks and binding regulatory requirements.
In 2025, twelve companies published or updated Frontier AI Safety Frameworks, documents describing how they plan to manage risks as they build more capable models. These frameworks, while varying in specificity and rigor, represented a significant normalization of safety governance within the AI industry [32].
On the regulatory front, multiple US states enacted AI safety laws targeting frontier models. California's SB 53 (Transparency in Frontier Artificial Intelligence Act) took effect on January 1, 2026, requiring standardized safety disclosures from frontier model developers. New York's RAISE Act (Responsible AI Safety and Education Act), signed by Governor Kathy Hochul on December 19, 2025, established the nation's first comprehensive reporting and safety governance regime for frontier AI model developers, applying to companies with $500 million or more in annual revenue that develop models trained using more than 10^26 FLOPs. The RAISE Act requires 72-hour incident reporting to the New York Attorney General and Division of Homeland Security, significantly faster than California's 15-day requirement [33] [34].
The EU AI Act's rules for general-purpose AI models became applicable in August 2025, with the General-Purpose AI Code of Practice published in July 2025. Signatories to the Code's Safety and Security chapter include OpenAI, Anthropic, Google, and xAI [26].
The second International AI Safety Report was published in February 2026, providing the most authoritative multilateral assessment of AI capabilities and risks to date [30].
Reinforcement learning from human feedback (RLHF) is the primary technique currently used to align large language models with human preferences. The process works in several stages. First, a reward model is trained using data from human annotators who rank model outputs by quality. This reward model learns to predict which outputs humans would prefer. Then, the language model is fine-tuned using reinforcement learning (typically proximal policy optimization, or PPO) to maximize the reward model's scores [3].
RLHF has been used to train many prominent AI systems, including OpenAI's ChatGPT and InstructGPT, Google DeepMind's Sparrow and Gemini, and Anthropic's Claude. The technique has proven effective at making models more helpful and less likely to produce harmful content.
However, RLHF has significant limitations as a safety technique. Researchers frequently discover jailbreaks that bypass RLHF-trained safety restrictions. The technique can encourage sycophancy, where models tell users what they want to hear rather than what is accurate. Perhaps most fundamentally, as AI systems surpass human cognitive abilities, human annotators will increasingly struggle to evaluate model outputs, undermining the core mechanism of RLHF [3].
Direct preference optimization (DPO), introduced in 2023, has emerged as a simpler alternative to RLHF that bypasses the need for a separate reward model. Instead of training a reward model and then using reinforcement learning, DPO directly optimizes the language model using preference data. The technique reformulates the RLHF objective into a classification loss that can be applied directly to the policy model, simplifying the training pipeline and reducing computational costs.
By 2025, DPO and its variants (including iterative DPO and Kahneman-Tversky Optimization) had been adopted by multiple labs as alternatives or supplements to traditional RLHF. Researchers noted that DPO tends to produce more stable training dynamics while achieving comparable alignment performance, though debate continues about whether it matches RLHF's effectiveness on the most complex alignment challenges [35].
Constitutional AI (CAI), developed by Anthropic and described in a December 2022 paper, provides an alternative to pure human feedback. In CAI, an AI system is given a set of principles (a "constitution") drawn from sources like the UN Declaration of Human Rights, trust and safety best practices, and principles from other AI research labs. The model is then trained to evaluate and revise its own outputs according to these principles [17].
The CAI process has two phases. In the supervised learning phase, the model generates responses, critiques them against its constitution, produces revised responses, and is then fine-tuned on the improved outputs. In the reinforcement learning phase, the model generates pairs of responses, evaluates which better satisfies the constitution, and a preference model is trained from these evaluations. This "RL from AI Feedback" (RLAIF) approach reduces the need for human labelers while still producing models that Anthropic reported to be both more helpful and more harmless than those trained with RLHF alone [17].
Red-teaming is the practice of deliberately attempting to make AI systems fail or produce harmful outputs, in order to discover and fix vulnerabilities before deployment. The approach is borrowed from cybersecurity and military strategy.
Modern AI red-teaming takes several forms:
| Method | Description |
|---|---|
| Manual red-teaming | Human experts craft adversarial prompts and scenarios to test model responses |
| Automated red-teaming | AI systems generate large numbers of adversarial inputs to test other AI systems at scale |
| Domain-specific red-teaming | Specialists in areas like biosecurity, cybersecurity, or CBRN threats test whether models can provide dangerous information |
| Hybrid approaches | Combinations of human expertise and automated generation |
| Agentic red-teaming | Testing AI agents in realistic multi-step scenarios involving tool use, web browsing, and code execution |
Anthropic, OpenAI, and Google DeepMind all conduct extensive red-teaming programs. Anthropic's frontier red team focuses on CBRN, cybersecurity, and autonomous AI risks, spending extended periods with domain experts to test model capabilities [18]. OpenAI has described using a mix of manual, automated, and hybrid approaches with external experts [19].
Frameworks for standardized red-teaming have proliferated. DeepTeam (released November 2025) and Nvidia's Garak provide open-source tools for red-teaming LLM systems. The red-teaming services market is projected to grow to $5.5 billion worldwide by 2033, reflecting the growing importance of adversarial testing as a discipline [36].
Red-teaming practices remain unstandardized across the industry, though efforts to develop common benchmarks and methodologies have accelerated. Different organizations use different techniques to assess the same threat models, making it difficult to compare the relative safety of different AI systems.
Interpretability research seeks to understand the internal workings of AI systems. The most ambitious branch of this work, mechanistic interpretability, aims to reverse-engineer neural networks at the level of individual neurons and circuits.
In 2024, Anthropic announced a breakthrough: researchers had identified interpretable features inside the Claude language model that corresponded to recognizable concepts, such as specific people and landmarks. In 2025, Anthropic extended this work to trace entire sequences of features, mapping the path a model takes from prompt to response. Teams at OpenAI and Google DeepMind used similar techniques to investigate unexpected model behaviors, including instances where models appeared to engage in deception [20].
MIT Technology Review named mechanistic interpretability one of its 10 Breakthrough Technologies for 2026, reflecting the field's rapid growth. Key techniques include sparse autoencoders (SAEs), activation patching, circuit tracing, and chain-of-thought monitoring (listening to the internal reasoning steps of models) [20].
Sparse autoencoders have proven particularly valuable. SAEs have revealed rich, interpretable structure within large language models, with researchers discovering that individual features correspond to cities, people, and abstract concepts including deception and bias. These discoveries have enabled targeted interventions: by identifying and modifying specific internal features, researchers can alter model behavior in predictable ways without retraining [35].
Despite this progress, mechanistic interpretability faces serious scalability challenges. Current techniques require extensive computational resources and highly skilled researchers. While tools like activation patching work in controlled experimental settings, they are not yet tractable for models with hundreds of billions of parameters [20].
Circuit breakers, developed by researchers at Gray Swan AI and Carnegie Mellon University and published in 2024, represent a novel defense approach that operates directly on a model's internal representations rather than on its inputs or outputs. Inspired by electrical circuit breakers that trip when current exceeds safe levels, representation-level circuit breakers activate when a model's internal state enters harmful subspaces [37].
The technique works by training the model to associate harmful internal representations with "rerouted" outputs, effectively breaking the chain of reasoning before it can produce dangerous content. On Mistral-7B-Instruct-v2, circuit breaking reduced harmful output rates from 76.7% to 9.8%; on Llama-3-8B-Instruct, from 38.1% to 3.8%. The approach is attack-agnostic, meaning it focuses on the result of attacks rather than the specific technique used to bypass safety training [37].
However, multi-turn jailbreaks like the Crescendo attack have proven effective against circuit breakers, highlighting a significant generalization gap between single-turn defenses and extended conversational attacks [37].
Formal verification uses mathematical proofs to guarantee that a system behaves according to its specification. In traditional software engineering, formal methods have been used to verify critical systems in aviation, nuclear power, and cryptography. Applying these methods to AI safety is appealing in principle: rather than testing a model empirically (which can never cover all possible inputs), formal verification would provide provable guarantees.
In practice, formal verification of large neural networks remains extremely challenging. Current techniques struggle to scale to realistic model sizes. Neural networks are high-dimensional, nonlinear systems, and the properties we care about ("does not produce harmful outputs") are difficult to formalize precisely. Tools like TLA+, Coq, and Lean have been applied to components of AI systems, but comprehensive formal verification of frontier models remains out of reach [21].
Research continues on hybrid approaches that combine formal methods with empirical testing and probabilistic guarantees.
Sandboxing restricts an AI system's access to external resources, limiting its ability to cause harm. This is particularly relevant for AI agents that can take actions in the real world, such as executing code, browsing the web, or interacting with APIs.
Key containment techniques include:
| Technique | How it works |
|---|---|
| Physical isolation (airgapping) | Disconnecting the AI system from external networks entirely |
| OS-level sandboxing | Using operating system features to restrict which system calls and APIs the AI can access |
| gVisor | Intercepting system calls through a user-space kernel instead of the host kernel |
| MicroVMs | Running each AI workload in its own lightweight virtual machine with a dedicated kernel |
| Permission systems | Requiring human approval before the AI can take high-impact actions |
| Protocol-level security | Restricting tool access and capabilities through secure communication protocols |
Sandboxing is widely used in practice for AI coding assistants and agents. However, some researchers have argued that containment alone is insufficient for highly capable systems, which might find ways to influence operators through their permitted communication channels or exploit subtle vulnerabilities in the containment infrastructure [21].
The rapid adoption of the Model Context Protocol (MCP) for connecting language models to external tools and data has introduced new containment challenges. Researchers have identified tool poisoning, remote code execution flaws, overprivileged access, and supply chain tampering within MCP ecosystems, underscoring the need for protocol-level security alongside traditional sandboxing approaches [31].
Numerous organizations now work on AI safety, spanning industry labs, nonprofits, and government bodies.
Anthropic was founded in 2021 by former OpenAI researchers, including Dario and Daniela Amodei. The company describes its mission as "the responsible development and maintenance of advanced AI for the long-term benefit of humanity." Anthropic developed Constitutional AI and the Responsible Scaling Policy (RSP), a framework that defines AI Safety Levels (ASL-1 through ASL-4 and beyond), with progressively stricter safety and security requirements as model capabilities increase. In May 2025, Anthropic activated ASL-3 safeguards for its most capable models, targeting risks from models that could assist individuals with undergraduate STEM backgrounds in creating CBRN weapons. Version 3.0 of the RSP, effective February 24, 2026, represents a comprehensive rewrite that introduces Frontier Safety Roadmaps with detailed safety goals and Risk Reports that quantify risk across all deployed models. The updated RSP acknowledges that higher-ASL mitigations, particularly those needed against well-resourced threat actors, require industry-wide or government coordination that no single company can guarantee [22].
In February 2026, Anthropic became the center of a major controversy when the Trump administration ordered all federal agencies to stop using the company's AI technology after Anthropic refused to remove ethical guardrails preventing the use of Claude in fully autonomous military targeting operations and mass domestic surveillance. Defense Secretary Pete Hegseth characterized the company's safety guardrails as "corporate virtue-signaling," while the Pentagon finalized a deal to use Elon Musk's Grok AI in classified military networks as an alternative [38].
OpenAI established a dedicated safety function through its Preparedness team, created in late 2023. The team published the Preparedness Framework, updated to Version 2 on April 15, 2025, which tracks frontier capabilities across categories including cybersecurity, persuasion, CBRN threats, and autonomy. The framework defines two key thresholds: "High" capability (could amplify existing harm pathways) and "Critical" capability (could introduce unprecedented new harm pathways). For its o3 and o4-mini models released in 2025, OpenAI completely rebuilt its safety training data and deployed a dedicated monitoring system for biological and chemical threat prompts, reporting that models declined to respond to risky prompts 98.7% of the time in testing [23].
OpenAI also launched a Superalignment team in 2023 co-led by Ilya Sutskever and Jan Leike, with the goal of solving the alignment problem for superintelligent AI within four years. However, both leaders departed in 2024, with Leike citing frustration over the company's commitment to safety [23].
Google DeepMind published its Frontier Safety Framework (FSF), which has evolved through three versions. Version 1.0 (May 2024) introduced the concept of Critical Capability Levels. Version 2.0 (February 2025) was implemented in safety and governance processes for frontier models such as Gemini 2.0. Version 3.0 (September 2025) introduced a new Critical Capability Level focused on harmful manipulation, specifically AI models with powerful manipulative capabilities that could systematically change beliefs and behaviors in high-stakes contexts. The third iteration also expanded safety reviews to cover scenarios where models may resist human shutdown or control [24].
MIRI (Machine Intelligence Research Institute) is one of the oldest AI safety organizations, founded in 2000 by Eliezer Yudkowsky. Its researchers originated many of the core concepts in AI alignment. In 2024, MIRI announced a strategic pivot, stating that alignment research had progressed too slowly and was "extremely unlikely to succeed in time to prevent an unprecedented catastrophe." The organization shifted its focus to policy advocacy [9].
Center for AI Safety (CAIS) is a nonprofit focused on reducing societal-scale risks from AI through research, field-building, and advocacy. It is best known for its May 2023 statement on AI extinction risk [1].
METR (Model Evaluation and Threat Research) was originally ARC Evals, a team within the Alignment Research Center (founded in 2021 by Paul Christiano). ARC Evals focused on evaluating frontier AI models' potential for harmful autonomous capabilities, including self-improvement and deception. It spun off as an independent nonprofit in December 2023 and was renamed METR. The name references metrology, the science of measurement. In January 2026, METR published a comprehensive reference guide to frontier AI safety regulations across jurisdictions, reflecting its growing role in bridging technical evaluation and policy [25].
ML Alignment Theory Scholars (MATS) is a training program for aspiring AI safety researchers. The Summer 2026 cohort is the program's largest to date, with 120 fellows and 100 mentors working across focus areas including scalable oversight, adversarial robustness and AI control, model organisms, mechanistic interpretability, AI security, and model welfare [35].
UK AI Safety Institute (now AI Security Institute) was established following the Bletchley Park summit in November 2023, with approximately 100 million GBP in public funding. It has built one of the world's largest safety evaluation teams and conducted pre-deployment evaluations of frontier models, including a joint evaluation of OpenAI's o1 model with its US counterpart. The institute was renamed the AI Security Institute in 2025 [6].
US AI Safety Institute was created within the National Institute of Standards and Technology (NIST) following President Biden's October 2023 executive order. It collaborates with the UK institute on joint frontier model evaluations. Following the change in administration in January 2025, the institute was reorganized and renamed the Center for AI Standards and Innovation (CAISI) [6].
The following table summarizes the major safety frameworks published by leading AI labs as of early 2026.
| Framework | Organization | Version | Key mechanism | Risk tiers |
|---|---|---|---|---|
| Responsible Scaling Policy (RSP) | Anthropic | v3.0 (Feb 2026) | AI Safety Levels (ASL-1 through ASL-4+) with escalating requirements | Capability-based thresholds triggering progressively stricter safeguards |
| Preparedness Framework | OpenAI | v2 (Apr 2025) | Capability evaluations across risk categories | "High" and "Critical" capability thresholds with specific operational commitments |
| Frontier Safety Framework (FSF) | Google DeepMind | v3.0 (Sep 2025) | Critical Capability Levels (CCLs) | Domain-specific thresholds for biosecurity, cybersecurity, autonomy, and manipulation |
| Llama Safety Framework | Meta | 2024 | Purple teaming with external researchers | Use case-specific evaluations and guardrails |
| Date | Event | Significance |
|---|---|---|
| 1960 | Norbert Wiener publishes warnings about machine autonomy | One of the earliest warnings about misaligned machine goals |
| 1965 | I.J. Good describes the "intelligence explosion" concept | Foundational concept for AI existential risk arguments |
| 2000 | Singularity Institute for Artificial Intelligence founded | First organization dedicated to AI safety (later became MIRI) |
| 2014 | Nick Bostrom publishes Superintelligence | Brought AI existential risk into mainstream academic discourse |
| December 2015 | OpenAI founded | Major AI lab with explicit safety mission |
| January 2017 | Asilomar Conference on Beneficial AI | Produced 23 principles for beneficial AI development |
| November 2022 | ChatGPT released | Brought AI capabilities and risks to mass public attention |
| March 29, 2023 | "Pause Giant AI Experiments" open letter | Over 30,000 signatures calling for a six-month training moratorium |
| May 30, 2023 | CAIS extinction risk statement | Hundreds of leaders equate AI risk with pandemics and nuclear war |
| October 30, 2023 | Biden signs Executive Order 14110 | Most comprehensive US AI governance action at the time |
| November 1-2, 2023 | Bletchley Park AI Safety Summit | 28 countries sign the Bletchley Declaration; UK AI Safety Institute announced |
| May 21-22, 2024 | Seoul AI Safety Summit | 16 companies sign Frontier AI Safety Commitments; international AI safety network launched |
| February 10-11, 2025 | Paris AI Action Summit | 58 countries sign declaration; US and UK decline; International AI Safety Report published |
| May 2025 | Anthropic activates ASL-3 safeguards | First major activation of a tiered AI safety level system for production models |
| July 2025 | EU GPAI Code of Practice published | Operational compliance framework for general-purpose AI in the EU |
| September 2025 | Google DeepMind publishes FSF v3.0 | Adds manipulation and shutdown resistance as critical capability domains |
| December 2025 | New York RAISE Act signed into law | First US state comprehensive reporting and safety governance regime for frontier AI |
| January 1, 2026 | California SB 53 takes effect | Mandatory safety disclosures for frontier AI model developers |
| February 2026 | International AI Safety Report 2026 published | Most authoritative multilateral assessment of AI capabilities and risks, authored by 100+ experts |
| February 2026 | Anthropic banned from US federal contracts | First AI company banned for refusing to remove safety guardrails for military use |
The EU AI Act is the world's first comprehensive legal framework for regulating artificial intelligence. It entered into force on August 1, 2024, with a phased implementation timeline [26].
Key milestones:
| Date | Requirement |
|---|---|
| February 2, 2025 | Bans on "unacceptable-risk" AI practices and AI literacy requirements take effect |
| August 2, 2025 | Rules for general-purpose AI (GPAI) models, governance structures, penalties, and notified bodies begin applying |
| July 10, 2025 | European Commission publishes the GPAI Code of Practice |
| August 2, 2026 | Full enforcement, including high-risk AI system rules; penalties up to 35 million EUR or 7% of global revenue |
For frontier GPAI models (those exceeding 10^25 floating-point operations in training), the AI Act imposes additional requirements. Providers must establish formal governance structures with independent risk oversight, conduct rigorous testing by qualified independent external evaluators before deployment and after major updates, and perform these evaluations periodically thereafter [26].
However, on March 13, 2026, the Council of the European Union adopted its negotiating position on a proposal to delay the high-risk AI system requirements as part of the "Omnibus VII" simplification package. Under this proposal, standalone high-risk systems would face compliance requirements starting December 2, 2027, while high-risk systems embedded in products would have until August 2, 2028. Trilogue negotiations with the European Parliament are expected to follow [39].
US AI regulation has followed a more fragmented path. President Biden's Executive Order 14110 (October 30, 2023) was the most significant federal action, establishing safety testing requirements, equity protections, and international cooperation mandates. However, President Trump revoked the order on January 20, 2025, replacing it three days later with a new executive order titled "Removing Barriers to American Leadership in Artificial Intelligence," which shifted emphasis from oversight and risk mitigation to deregulation and innovation promotion [13].
In the absence of comprehensive federal legislation, US states have enacted their own AI laws. California passed the Transparency in Frontier Artificial Intelligence Act (California TFAIA), and Texas enacted the Responsible Artificial Intelligence Governance Act (Texas RAIGA), both taking effect on January 1, 2026. Colorado's comprehensive AI law was delayed to June 2026 after amendments in August 2025. New York's RAISE Act was signed on December 19, 2025, establishing the most detailed state-level frontier AI safety governance requirements to date [27] [33].
On December 11, 2025, President Trump signed an additional executive order titled "Ensuring a National Policy Framework for Artificial Intelligence," signaling an intent to consolidate AI oversight at the federal level and counter the expanding patchwork of state AI rules. The order proposes establishing a uniform federal policy framework that would preempt state AI laws deemed inconsistent with federal policy, setting the stage for potential legal challenges [27].
China has been an early and active regulator of AI, pursuing a sector-specific approach rather than a single comprehensive law. The Interim Measures for Administration of Generative AI Services, which took effect on August 15, 2023, made China the first country with binding regulations specifically for generative AI [28].
In 2024, China issued the national standard "Basic security requirements for generative artificial intelligence service," covering corpus safety, model safety, and required security measures. In May 2024, draft Security Requirements for Generative AI detailed technical measures for securing training data and models [28].
In 2025, three national standards for generative AI security took effect on November 1. New content labeling rules, effective September 1, 2025, require AI-generated content to carry visible labels for chatbots, AI-written text, synthetic voices, and face-generated or face-swapped content. Service providers offering generative AI with public opinion or social mobilization capabilities must conduct security assessments and register their large language models with the Cyberspace Administration of China (CAC) [28].
On January 1, 2026, significant amendments to China's Cybersecurity Law took effect, representing the most substantial update since the law's original adoption. For the first time, AI governance was elevated to the level of national law: the amendments explicitly provide that the state will support AI innovation, promote the development of training data resources and computing infrastructure, strengthen AI ethics regulation, and enhance AI risk assessment and security governance. Maximum fines for violations increased to RMB 10 million (approximately $1.4 million) [40].
Beyond the summits described above, international AI governance efforts include the OECD's AI Principles (first adopted in 2019, updated in 2024), the G7 Hiroshima AI Process, and the United Nations' work through the Secretary-General's AI Advisory Body, which published interim recommendations in late 2023. The Council of Europe adopted the Framework Convention on Artificial Intelligence in May 2024, the first legally binding international treaty on AI governance.
AI safety in early 2026 is characterized by several trends.
Frontier model testing is becoming mandatory. The EU AI Act's requirements for testing general-purpose AI models took effect in August 2025, and full enforcement including high-risk system rules arrives in August 2026 (with a potential delay to December 2027 under the proposed Omnibus VII amendments). Multiple US states now have AI safety laws in effect or approaching their effective dates, with California and New York leading on frontier model transparency and governance requirements. The insurance industry has also begun requiring documented evidence of adversarial red-teaming and model-level risk assessments as conditions of coverage [27].
Safety institutes are expanding. The international network of AI safety institutes, launched at the Seoul summit in 2024, has grown to include bodies in the UK, US, Japan, Singapore, and other nations. These institutes collaborate on shared testing methodologies and standards, though their mandates and powers vary. The UK AI Security Institute, with roughly 100 million GBP in funding, remains the best-resourced [6].
Technical capabilities are advancing faster than safety measures. The UK AI Security Institute's 2025 Frontier AI Trends Report found that frontier model capabilities in cybersecurity and scientific domains had improved dramatically. In chemistry and biology, AI models exceeded PhD-level expert performance by up to 60% on some domain-specific benchmarks. Yet the same report noted that model safeguards could be routinely circumvented. The 2026 International AI Safety Report reinforced this finding, warning of a "growing mismatch between the speed of AI capability advances and the pace of governance" [6] [30].
Interpretability is gaining traction. MIT Technology Review named mechanistic interpretability a top-ten breakthrough technology for 2026. Anthropic, OpenAI, and Google DeepMind have all published significant interpretability research. Chain-of-thought monitoring has emerged as a practical tool for understanding reasoning models. Sparse autoencoders have revealed rich internal structure in LLMs. However, current techniques remain computationally expensive and difficult to apply to the largest models [20].
The policy environment is fragmented. The EU has the most comprehensive regulatory framework, but implementation challenges remain, and the proposed Omnibus VII amendments may delay key enforcement dates by over a year. The US lacks federal AI safety legislation, with states filling the gap in an uncoordinated manner and the Trump administration signaling intent to preempt state-level regulation. China continues its sector-specific approach with the newly elevated AI provisions in its Cybersecurity Law amendments. International summits have produced declarations and voluntary commitments, but binding multilateral agreements remain limited to the Council of Europe treaty [27].
Corporate commitments face scrutiny. Major AI labs have published safety frameworks (Anthropic's RSP, OpenAI's Preparedness Framework, DeepMind's Frontier Safety Framework), and 16 companies signed the Seoul Frontier AI Safety Commitments. But critics have questioned the enforceability and sincerity of these pledges, particularly after the departures of senior safety personnel from OpenAI in 2024, reports of Anthropic revising elements of its RSP in 2026, and the Anthropic-Pentagon controversy that revealed the tension between corporate safety commitments and government pressure [22] [23].
Safety and national security are increasingly intertwined. The Anthropic-Pentagon dispute, the UK's renaming of its AI Safety Institute to the AI Security Institute, and the Trump administration's emphasis on AI for military advantage all reflect a shift in how AI safety is framed. The traditional framing of safety as protecting against unintended harms is being supplemented, and in some cases challenged, by a framing centered on national security competitiveness [38].
A January 2026 editorial in Nature called on the global community to use 2026 as the year to "come together for AI safety," noting that despite progress, the gap between the pace of AI capability development and the pace of safety measures continues to widen [29].