AI Safety Levels

AI Safety Anthropic

7 min read

Updated Jun 3, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 3, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v1 · 1,466 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

AI Safety Levels (ASL) are a tiered risk-classification scheme developed by Anthropic to grade frontier AI systems by their potential for catastrophic harm and to attach progressively stricter safety, security, and deployment requirements to each tier. The scheme was introduced in September 2023 as the technical core of Anthropic's Responsible Scaling Policy (RSP), the company's voluntary commitment governing how it trains and releases increasingly capable models. Levels run from ASL-1, reserved for systems with no meaningful catastrophic risk, through ASL-2 and ASL-3, with ASL-4 and higher left deliberately sketchy at the outset. ^[1]^[2]

Overview

The framework rests on a principle Anthropic calls proportional protection: the safeguards applied to a model should scale with the danger that model could plausibly enable. Rather than treating "AI risk" as a single undifferentiated quantity, ASL sorts systems into bands defined by capability, and pairs each band with a set of operational standards. As a model's measured abilities climb toward a defined threshold, it crosses into a higher level and the corresponding standards become mandatory before further training or deployment can proceed. ^[1]^[3]

Two kinds of standard sit behind each level. A Deployment Standard governs how a model may be exposed to users, covering measures such as misuse filtering and access restrictions. A Security Standard governs how the model itself is protected, chiefly the protection of its weights against theft. Higher levels demand more of both. ^[2]^[3]

Origins and the biosafety analogy

Anthropic modeled the scheme loosely on the United States government's biosafety level (BSL) system, which classifies laboratories and procedures into four tiers (BSL-1 through BSL-4) according to the danger of the pathogens involved and prescribes containment measures accordingly. The naming, the numbered escalation, and the basic logic of matching containment to hazard all carry over. Anthropic was explicit that the analogy is imperfect: unlike the relatively stable BSL regime, the AI version would need frequent iteration and course correction given the speed and uncertainty of the field. The company described the RSP as a living document expected to change as models advance. ^[1]^[2]

The framework was also positioned as a practical answer to a coordination problem. By committing in advance to specific safeguards triggered by specific capabilities, Anthropic argued it could keep developing competitive systems while bounding the worst outcomes, and could offer a template that other labs and regulators might adopt. ^[1]

The levels

The original September 2023 definitions are summarized below.

Level	Definition	Illustrative example
ASL-1	Systems that pose no meaningful catastrophic risk.	A 2018-era language model, or an AI that only plays chess.
ASL-2	Systems that show early signs of dangerous capabilities but where the information is not yet practically useful, because it is unreliable or no better than a search engine or textbook. Anthropic placed current models, including Claude, at this level.	Present-day large language models.
ASL-3	Systems that substantially increase the risk of catastrophic misuse relative to non-AI baselines (for example meaningful uplift toward chemical, biological, radiological, or nuclear weapons), or that show low-level autonomous capabilities.	Triggers stronger deployment and security standards.
ASL-4 and higher	Not defined in detail at launch, described as too far from present systems. Expected to involve qualitative escalations in misuse potential and autonomy.	Reserved for future, more capable systems.

A point often missed is that ASL-3 in the original text was disjunctive: a model could reach it either through misuse uplift or through low-level autonomy, not only through weapons-related capability. ^[1]^[2]

Required safeguards

ASL-2 standards reflect what Anthropic characterized as current industry best practice, including model cards, basic misuse filtering, and security commitments overlapping with the voluntary White House commitments of 2023. ^[1]

ASL-3 raises the bar on both axes. The Security Standard calls for hardened internal controls that make stealing model weights substantially harder, aimed at resisting theft by non-state and some state-level actors. The Deployment Standard calls for narrowly targeted controls to block the specific misuse pathways the model could enable, such as monitoring for and refusing CBRN-relevant requests. Anthropic's later implementation leaned on automated classifiers (related to its work on Constitutional AI) to screen inputs and outputs, and on additional weight-protection measures such as restricting outbound data flows from secure environments. ^[2]^[3]^[4]

For ASL-4 and beyond, Anthropic did not specify safeguards at launch, committing instead to define them before any model approached that level. Internal research has since explored what evidence might justify deploying such systems, including work on detecting and bounding autonomous misbehavior; see sabotage evaluations. ^[1]^[5]

Capability thresholds and triggering

The mechanism that moves a model between levels is evaluation against capability thresholds. Anthropic runs tests, and commissions outside review, to estimate whether a model has reached abilities that would warrant higher standards. If a model crosses a threshold, the matching ASL standard must be in place before it can be trained further or deployed. ^[2]^[3]

The October 2024 revision of the RSP reframed the scheme around this idea more explicitly, pairing each Capability Threshold with a set of Required Safeguards rather than presenting the levels as a fixed staircase. Two thresholds were singled out. A model able to meaningfully assist someone with basic technical training in creating or deploying CBRN weapons would require ASL-3 standards. A model able to independently conduct complex AI research normally requiring human experts would require ASL-4 or higher. ^[3]

Real-world application: Claude Opus 4 and ASL-3

In May 2025 Anthropic activated the ASL-3 Deployment and Security Standards for Claude Opus 4, making it the first model the company released under those protections. The decision was framed as precautionary and provisional rather than as a confirmed threshold crossing. Anthropic stated it had not determined that Opus 4 actually required ASL-3, only that, given continued gains in CBRN-relevant knowledge, it could no longer rule out the risk as it had for earlier models. The same disclosure said the company had ruled out the need for ASL-4 protections for Opus 4, and had ruled out ASL-3 for the lighter Claude Sonnet 4. The deployment controls were described as narrow enough that ordinary users would rarely notice refusals, affecting only a small band of CBRN-adjacent queries. The episode was reported by outlets including CNBC as the first time a major lab had shipped a model under its own highest active safeguards. ^[4]^[6]^[7]

Evolution and criticism

The scheme has changed substantially since 2023. The October 2024 update introduced the threshold-and-safeguard framing. Version 3.0, effective February 24, 2026, went further: it disaggregated the AI research and development thresholds into separate levels, added a CBRN threshold pegged to uplift for moderately resourced state programs, and introduced transparency mechanisms such as periodic Risk Reports and a Frontier Safety Roadmap. It also separated Anthropic's binding unilateral commitments from looser industry-wide recommendations, leaving ASL-3 as the main remaining unilateral commitment. ^[8]^[9]

The most contested change in v3.0 was the removal of the earlier, near-categorical pledge to pause development if adequate safeguards were not in place. The revised text instead ties any delay to a dual condition that must hold simultaneously, which several commentators read as a weakening of the original promise. ^[9]^[10]

Critics have pressed the framework on other points as well. Analyses on LessWrong, including work by Ryan Greenblatt, argued that because Anthropic can revise the policy, a commitment to "pause" until safeguards are ready can in practice mean pausing only until the standards themselves are lowered, and that the ASL-3 protections were defined more vaguely than the concrete ASL-2 measures. The persistent absence of detailed ASL-4 safeguards drew repeated attention, since those are the levels most associated with existential concern. METR, which reviewed drafts of the policy, cautioned that methods for assessing and mitigating risk are not keeping pace with capability gains. Supporters counter that transparent, honestly revised commitments are preferable to firm-sounding pledges that would be quietly abandoned, and that the Risk Reports and Roadmap add real accountability. ^[10]^[11]^[9]

References

Anthropic's Responsible Scaling Policy, Anthropic, September 19, 2023. ↩
Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust, LessWrong, 2023. ↩
Announcing our updated Responsible Scaling Policy, Anthropic, October 15, 2024. ↩
Activating AI Safety Level 3 protections, Anthropic, May 22, 2025. ↩
Three Sketches of ASL-4 Safety Case Components, Anthropic Alignment Science, 2024. ↩
System Card: Claude Opus 4 & Claude Sonnet 4, Anthropic, May 2025. ↩
Anthropic adds Claude 4 security measures to limit risk of users developing weapons, CNBC, May 23, 2025. ↩
Responsible Scaling Policy Version 3.0, Anthropic, February 24, 2026. ↩
Anthropic's RSP v3.0: How it Works, What's Changed, and Some Reflections, Centre for the Governance of AI, 2026. ↩
Anthropic Drops Hard Safety Limits From its AI Scaling Policy, WinBuzzer, February 25, 2026. ↩
Responsible Scaling Policy v3, LessWrong, 2026. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Sabotage evaluations