# Responsible Scaling Policy

> Source: https://aiwiki.ai/wiki/responsible_scaling_policy
> Updated: 2026-06-21
> Categories: AI Policy & Regulation, AI Safety
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **Responsible Scaling Policy** (**RSP**) is a self-imposed governance framework in which a frontier [AI](/wiki/ai_safety) developer commits in advance to safety practices, capability evaluations, deployment restrictions, and security measures that scale with the capability of its models, triggered when those models cross defined capability thresholds. [Anthropic](/wiki/anthropic), which coined the term and published the first RSP on September 19, 2023, describes it as "a series of technical and organizational protocols that we're adopting to help us manage the risks of developing increasingly capable AI systems." [1] The defining feature is pre-commitment: a developer publicly fixes safeguards before its models reach dangerous capability levels, so that decisions about training or deployment are constrained by rules written before the commercial pressure to ship arrives. [37] Capability thresholds in an RSP are often called "AI Safety Levels" (ASLs), "Critical Capability Levels" (CCLs), or numerically graded risk tiers.

The label remains specific to Anthropic, but other frontier developers have since published functionally similar documents: [OpenAI](/wiki/openai)'s Preparedness Framework (December 18, 2023, with v2 in April 2025), [Google DeepMind](/wiki/google_deepmind)'s Frontier Safety Framework (May 17, 2024, with later versions in February 2025, September 2025, and April 2026), and [Meta](/wiki/meta_ai)'s Frontier AI Framework (February 2025). [15][21][26] RSP-style policies are now a standard component of [AI governance](/wiki/ai_governance) and are referenced in the [EU AI Act](/wiki/eu_ai_act)'s General-Purpose AI Code of Practice (August 2025), the [Bletchley Declaration](/wiki/bletchley_declaration) (November 2023), and the Frontier AI Safety Commitments signed at the [Seoul AI Safety Summit](/wiki/seoul_declaration) (May 2024). [33][34][54] By December 2025, [METR](/wiki/metr) counted twelve frontier developers with published frameworks, including Anthropic, OpenAI, Google DeepMind, Magic, Naver, Meta, G42, Cohere, Microsoft, Amazon, xAI, and NVIDIA. [38]

As of June 2026, the current version of Anthropic's RSP is **version 3.1**, effective April 2, 2026. [7] The most consequential operational event under the policy to date was the activation of **ASL-3** safeguards for [Claude](/wiki/claude) Opus 4 on May 22, 2025, the first real-world deployment of an AI Safety Level above ASL-2. [8]

## What are the core ingredients of an RSP?

A Responsible Scaling Policy has four standard ingredients:

1. A **risk taxonomy** naming the catastrophic harms the developer cares about. Common categories are misuse for chemical, biological, radiological, or nuclear (CBRN) weapons; offensive cyber operations; large-scale persuasion or manipulation; autonomous AI research and development (AI R&D); and autonomous replication and adaptation (ARA), where a model acquires resources, copies itself, and operates without oversight. [43]
2. **Capability thresholds**. For each risk category the policy defines a level of capability that, once crossed, triggers a stricter safety regime. Anthropic uses ASL-1 through ASL-4+; OpenAI uses Low, Medium, High, and Critical per category (later collapsed to High and Critical); Google DeepMind uses CCLs. [38]
3. A **measurement procedure**. The developer commits to running dangerous-capability evaluations on a defined cadence (six months in Anthropic's RSP v2, or whenever effective compute increases by a defined multiplier) and before any major deployment. [3]
4. **Required safeguards** that map to each capability level: deployment restrictions, security controls on weights, misuse monitoring, red-teaming, and in some cases an explicit pause on further training. [38]

The policy's bite comes from the conditional structure: a developer commits in advance that, if a future evaluation shows a model crossing a threshold and the matching safeguards are not in place, it will halt deployment or training rather than ship the model. This removes the live commercial decision from the moment of measurement and replaces it with a rule written months earlier. [37] As discussed below, the strength of that pre-commitment has weakened in some recent revisions.

## Anthropic's Responsible Scaling Policy

### Version 1.0, September 19, 2023

Anthropic published RSP v1.0 on September 19, 2023. [1] The document defined four AI Safety Levels which Anthropic described as "modeled loosely after the US government's biosafety level (BSL) standards for handling of dangerous biological materials." [1] Higher ASLs require increasingly strict demonstrations of safety before models can be developed or deployed. [2]

The original ASL definitions:

| Level | Description | Examples |
|-------|-------------|----------|
| ASL-1 | Systems that pose no meaningful catastrophic risk. | A 2018 LLM, or an AI system that only plays chess. |
| ASL-2 | Systems that show early signs of dangerous capabilities (such as describing how to build bioweapons) but where the information is not yet useful in practice due to insufficient reliability or because it is not meaningfully better than what a determined searcher could find. | Frontier LLMs at the time of publication, including [Claude](/wiki/claude). |
| ASL-3 | Systems that substantially increase the risk of catastrophic misuse compared to non-AI baselines (such as a search engine or a textbook), or that show low-level autonomous capabilities. | Not yet reached at the time of v1.0. |
| ASL-4 and higher | Reserved for future revisions; involves qualitative escalations in catastrophic misuse potential and autonomy. | Speculative. |

For each ASL, Anthropic specified a "Required Safeguards" set: a deployment standard governing external use, and a security standard for weights and training infrastructure. [2] ASL-3 added enhanced internal access controls on weights, hardened deployment with abuse monitoring, and red-teaming designed to demonstrate the model could not provide meaningful uplift to a CBRN attacker. [2]

v1.0 also formalised governance practices: board approval for any policy change, a designated Responsible Scaling Officer, public reporting of evaluation results, and a commitment to pause scaling if the safeguards for a model's capability level were not yet ready. [2]

### Version 2.0, October 15, 2024

Anthropic published RSP v2.0 on October 15, 2024. [3] The update kept the ASL structure but reorganised the policy around named **Capability Thresholds** that, when crossed, place a model under the corresponding ASL safeguards. [4] The two thresholds defined explicitly in v2.0 are:

- **CBRN-3**: a model that meaningfully assists individuals or groups with basic technical backgrounds (for example, an undergraduate STEM degree) in creating or deploying chemical, biological, radiological, or nuclear weapons. Crossing this threshold requires the ASL-3 Deployment and Security Standards. [4]
- **AI R&D-4 / AI R&D-5**: capabilities that allow a model to fully automate an entry-level researcher's work at Anthropic, or cause dramatic acceleration in effective scaling. The higher AI R&D threshold requires ASL-4 safeguards. [4]

v2.0 clarified that ASL labels refer to **groups of safeguards**, not to models; a model can be operating under ASL-2 standards while approaching the CBRN-3 threshold. [4] The evaluation cadence was lengthened from roughly three months to six months, on the rationale that the previous interval did not leave enough time for high-quality capability elicitation. [3] Anthropic disclosed that its first round of internal compliance had fallen short of v1.0's procedural commitments; this motivated the rewrite. [3]

Jared Kaplan, Anthropic's chief science officer, took on the role of Responsible Scaling Officer at the v2.0 transition, with a separate Head of Responsible Scaling position created for day-to-day work. [3]

### Compliance review and the missed commitments of v1.0

Alongside the v2.0 release, Anthropic conducted an internal review of its compliance with v1.0 and disclosed two specific shortfalls: [3]

1. **Evaluation timing.** The most recent round of evaluations was completed three days after the three-month window required by v1.0. Anthropic argued that the delay allowed for higher-quality capability elicitation and used the experience to extend the interval to six months in v2.0. [3]
2. **Autonomy evaluation drift.** The autonomy evaluation set was updated from the placeholder tasks named in v1.0 without a parallel policy update, even though one reading of the v1.0 text required such an update. [3]

Anthropic concluded the lapses posed minimal risk in the specific cases, but cited them as evidence that a tight three-month cadence and rigid evaluation suite had become brittle. [3] The episode is one of the few public accounts of an RSP signatory grading itself against its own document. Critics including SaferAI argued the response weakened the document overall. [14]

### Versions 2.1 and 2.2 (March and May 2025)

Minor updates were issued as v2.1 (March 31, 2025) and v2.2 (May 14, 2025). [5] v2.1 clarified the procedure for evaluating ASL-3 deployment safeguards, including how the criterion of "sufficiently minimising the residual risk" should be interpreted. v2.2, effective the day before [Claude](/wiki/claude) Opus 4 launch, fine-tuned the threat models for CBRN uplift and clarified that holding back capability via post-training restrictions could substitute for raw threshold avoidance only if the restriction itself was robust to jailbreaks. [5]

### When was ASL-3 first activated? (May 22, 2025)

The most consequential operational event came on May 22, 2025: alongside the launch of [Claude](/wiki/claude) Opus 4 and Claude Sonnet 4, Anthropic announced that it was activating the **ASL-3 Deployment and Security Standards** for Claude Opus 4. [8] This was the first real-world activation of an ASL above ASL-2. [8] Notably, Anthropic did not claim the model had definitively crossed the CBRN-3 threshold; it activated the standard as a precaution. As the company stated, "We have not yet determined whether Claude Opus 4 has definitively passed the Capabilities Threshold that requires ASL-3 protections," adding that proactively enabling a higher standard "simplifies model releases while allowing us to learn from experience by iteratively improving our defenses." [8]

The ASL-3 measures included "constitutional classifiers" trained to detect and refuse CBRN-related queries, more than 100 new internal security controls including egress bandwidth restrictions to slow any exfiltration of model weights, and two-party authorisation for sensitive operations on production assets. [8] Subsequent Opus releases ([Claude Opus 4.5](/wiki/claude_opus_4_5) in November 2025, [Claude Opus 4.6](/wiki/claude_opus_4_6) in early 2026, and [Claude Opus 4.7](/wiki/claude_opus_4_7)) shipped under the same ASL-3 standard. [Claude Sonnet 4.5](/wiki/claude_sonnet_4_5) became the first Sonnet-tier model to require ASL-3, while [Claude Haiku 4.5](/wiki/claude_haiku_4_5) launched at ASL-2 after Anthropic's evaluations placed it below the rule-out threshold for biology and autonomy.

### Version 3.0, February 24, 2026

Version 3.0 of the RSP became effective February 24, 2026. [6] v3.0 was a structural overhaul rather than an incremental edit. It introduced a public **Frontier Safety Roadmap** describing concrete plans for risk mitigations across Security, Alignment, Safeguards, and Policy, and a system of **Risk Reports** explaining how a given model's capabilities, threat models, and active mitigations fit together. [6] v3.0 disaggregated the AI R&D threshold into two distinct levels, added a new CBRN development capability threshold, and replaced the autonomous replication and adaptation threshold with an autonomous-capabilities "checkpoint" that triggers additional evaluation rather than automatic safeguard escalation. [6]

The most controversial change was the removal of the categorical pause commitment. Earlier versions barred Anthropic from training or deploying a model that crossed a Capability Threshold without matching safeguards in place. v3.0 replaced that hard stop with a softer pledge to "delay" development in cases where leadership judges Anthropic to be the AI race leader and the catastrophic-risk concerns to be significant. [6] Time magazine's exclusive on the change, headlined "Anthropic Drops Flagship Safety Pledge", framed it as a step away from the original RSP philosophy. [11] Jared Kaplan defended the change publicly: "We felt that it wouldn't actually help anyone for us to stop training AI models" if competitors continued. [11] Critics including Zvi Mowshowitz characterised it as the loss of the RSP's core load-bearing commitment. [13]

### Version 3.1, April 2, 2026

Version 3.1 followed on April 2, 2026 with two clarifications added in response to reader feedback on v3.0: [7]

- A clearer definition of the AI R&D capability threshold. The v3.0 phrasing referring to "doubling the rate of progress" had been read by some as meaning either a doubling of aggregate capability progress or merely a doubling of researcher productivity. v3.1 specified the former. [7]
- Confirmation that, even where the RSP itself does not require it, Anthropic remains free to pause training or deployment when it judges a pause appropriate. [7]

The second clarification was widely read as a response to the framing of the v3.0 changes as a relaxation, although it did not restore the categorical pause commitment. [7] As of June 2026, version 3.1 is the current effective version of the policy. [7]

### Anthropic RSP version history

| Version | Effective | Key changes |
|---------|-----------|-------------|
| 1.0 | September 19, 2023 | First RSP. Defines ASL-1 to ASL-4. Three-month evaluation cadence. |
| 2.0 | October 15, 2024 | Capability Thresholds (CBRN-3, AI R&D-4, AI R&D-5). Six-month cadence. ASL applied to safeguards, not models. Compliance review of v1.0 published. |
| 2.1 | March 31, 2025 | Clarified ASL-3 deployment criterion and "sufficiently minimising" residual risk. |
| 2.2 | May 14, 2025 | Tightened CBRN uplift threat models; clarified jailbreak-robust post-training restrictions. |
| 3.0 | February 24, 2026 | Frontier Safety Roadmap; Risk Reports; CBRN development threshold; AI R&D split into two levels; ARA replaced with autonomy checkpoint; categorical pause replaced with conditional delay pledge. |
| 3.1 | April 2, 2026 | Sharpened AI R&D threshold definition; reaffirmed Anthropic may pause development independently of RSP requirements. Current version. |

### Claude model ASL classifications

| Model | Released | ASL standard | Notes |
|-------|----------|---------------|-------|
| [Claude 1](/wiki/claude) | March 2023 | ASL-2 | Pre-RSP; retroactively assigned. |
| Claude 2 | July 2023 | ASL-2 | Evaluated by ARC Evals against early ASL-3 indicators. |
| Claude 3 family | March 2024 | ASL-2 | ARA evaluations conducted by [METR](/wiki/metr). |
| Claude 3.5 Sonnet (upgraded) | October 2024 | ASL-2 | First joint US AISI / UK AISI evaluation. |
| [Claude 3.7 Sonnet](/wiki/claude) | February 2025 | ASL-2 | Extended thinking debuted under ASL-2. |
| Claude Sonnet 4 | May 22, 2025 | ASL-2 | Released alongside Opus 4. |
| [Claude Opus 4](/wiki/claude) | May 22, 2025 | ASL-3 | First-ever ASL-3 activation. |
| Claude Opus 4.1 | August 5, 2025 | ASL-3 | Coding and tool-use refinements. |
| [Claude Sonnet 4.5](/wiki/claude_sonnet_4_5) | September 2025 | ASL-3 | First Sonnet at ASL-3. |
| [Claude Haiku 4.5](/wiki/claude_haiku_4_5) | October 2025 | ASL-2 | Below rule-out threshold on biology and autonomy. |
| [Claude Opus 4.5](/wiki/claude_opus_4_5) | November 24, 2025 | ASL-3 | Same standard as Opus 4. |
| Claude Sonnet 4.6 | Late 2025 / early 2026 | ASL-3 | Performed at or below Opus 4.6 on automated evals. |
| [Claude Opus 4.6](/wiki/claude_opus_4_6) | 2026 | ASL-3 | First Opus released under RSP v3 framing. |
| [Claude Opus 4.7](/wiki/claude_opus_4_7) | 2026 | ASL-3 | Same classification; no model has yet been declared ASL-4. |

## OpenAI's Preparedness Framework

OpenAI published the beta of its Preparedness Framework on December 18, 2023, the first RSP-style policy from OpenAI. [15] It was authored by a newly created Preparedness team initially led by Aleksander Madry, reorganised in 2024 after several departures. [15]

The original framework tracked four risk categories, each scored Low, Medium, High, or Critical: [15]

| Category | Examples of risk |
|----------|------------------|
| Cybersecurity | Uplift to offensive cyber operations, automated discovery and exploitation of vulnerabilities. |
| CBRN | Chemical, biological, radiological, and nuclear weapons assistance. |
| Persuasion | Highly tailored mass persuasion or manipulation. |
| Model autonomy | Self-exfiltration, autonomous replication, autonomous economic activity. |

The deployment rules: a model at High in any category can only be deployed once mitigations bring the post-mitigation score below High; a model at Critical in any category "cannot be developed" until the score is reduced. [15]

### Version 2 (April 15, 2025)

The Preparedness Framework was updated to **version 2** on April 15, 2025. [16] v2 narrowed the Tracked Categories to Biological and Chemical, Cybersecurity, and AI Self-Improvement, and introduced separate Research Categories (Long-range Autonomy, Sandbagging, Autonomous Replication and Adaptation, Undermining Safeguards, and Nuclear and Radiological). [17] Persuasion was moved out of the Preparedness Framework, a change some observers read as softening the original commitments. [16] The four-tier Low/Medium/High/Critical scoring was retained. [17]

### Streamlining to High and Critical

A later refresh of the Preparedness Framework collapsed the deployment-relevant scoring to two thresholds. The Low and Medium tiers were dropped from the headline classification because they did not, on their own, trigger any specific safeguards. A model that crosses the High threshold can only be deployed after safeguards "sufficiently minimise" the associated risk; a Critical classification additionally restricts development. Each release covered by the framework now ships with a **Safeguards Report** documenting how each safeguard was designed and verified. The change also introduced a sharper expectation about what "sufficiently minimise" means in practice and aligned the framework's vocabulary more closely with the EU GPAI Code of Practice. [55]

### High classification of GPT-5 and successors

OpenAI's pre-deployment evaluations under the framework have been published in system cards for GPT-4 Turbo, GPT-4o, [o1](/wiki/o1), o3, o4-mini, GPT-4.5, and the GPT-5 family. [18] The most consequential public classification came with [GPT-5](/wiki/gpt-5): OpenAI declared GPT-5 (specifically GPT-5-thinking) **High capability in the Biological and Chemical** domain and activated the associated safeguards. [18] Successive system cards ([GPT-5.1](/wiki/gpt-5.1), GPT-5.2, GPT-5.3-Codex, GPT-5.4, GPT-5.5) maintained the High classification in Biological and Chemical and added cybersecurity risk evaluation for GPT-5.5. [19][20] GPT-5.5 was treated as High in both Biological and Chemical and in Cybersecurity, with the safeguards mix expanded accordingly. [19] No OpenAI model has been classified as Critical to date. [18]

### OpenAI Preparedness Framework version history

| Version | Date | Headline change |
|---------|------|-----------------|
| Beta (v1) | December 18, 2023 | Four Tracked Categories: Cyber, CBRN, Persuasion, Model autonomy. Low / Medium / High / Critical scale. |
| Version 2 | April 15, 2025 | Tracked Categories narrowed to Bio/Chem, Cyber, AI Self-Improvement. Persuasion moved out. New Research Categories. |
| Streamlined revision | 2026 | Levels collapsed to High and Critical. Safeguards Reports introduced. "Sufficiently minimise" criterion sharpened. |

## Google DeepMind's Frontier Safety Framework

Google [DeepMind](/wiki/google_deepmind) published the first version of its **Frontier Safety Framework** (FSF) on May 17, 2024, with v2.0 on February 4, 2025, v3.0 on September 22, 2025, and a further iteration on April 17, 2026 introducing Tracked Capability Levels. [21][22][23] The FSF is structured around **Critical Capability Levels** (CCLs), capability thresholds at which a model could pose heightened risk of severe harm absent mitigations. [21] CCLs are derived domain-by-domain: for each, DeepMind defines the minimal capability profile that would unlock that harm. [24]

The FSF identifies CCLs across:

- **Cyber**: offensive cyber operations against well-defended targets. [24]
- **CBRN / biosecurity**: meaningful uplift to actors developing biological, chemical, radiological, or nuclear weapons. [24]
- **Machine learning research and development**: autonomously conducting ML R&D in ways that could shorten timelines or remove humans from the loop. [24]
- **Autonomy**: long-horizon agentic capability and resilience to human oversight. [24]
- **Manipulation**: added in v3.0, harmful manipulation as a separate domain. The CCL targets models with manipulative capabilities sufficient to systematically and substantially change beliefs and behaviours in identified high-stakes contexts at severe scale. [23]
- **Shutdown resistance**: also added in v3.0, capabilities relevant to a model resisting human shutdown or correction. [23]

The framework operates in three stages: identify capabilities that could cause severe harm, run periodic "early warning" evaluations as frontier models approach a CCL, and apply escalating deployment and security mitigations once a CCL is reached. [21] Each major model release (Gemini 2.5 Pro, the [Gemini 3](/wiki/gemini_3) series) is accompanied by an FSF report. [25]

### Tracked Capability Levels (April 2026)

In April 2026, DeepMind added **Tracked Capability Levels** (TCLs) in selected domains. A TCL is set below the corresponding CCL and is meant to flag less extreme but still concerning capabilities earlier in the development cycle, allowing safeguard work to begin before the harder threshold is reached. TCLs align the FSF more closely with the staged review structure now common in other frameworks.

### FSF results for Gemini 3 Pro

The Gemini 3 Pro FSF report (November 2025) reported that the model showed a statistically significant difference from non-AI baselines on most manipulative-efficacy metrics, but no statistically significant difference between Gemini 2.5 Pro and Gemini 3 Pro versions on the same metrics. [25] No model has yet been declared to have reached a CCL. [25]

### DeepMind FSF version history

| Version | Date | Headline change |
|---------|------|-----------------|
| 1.0 | May 17, 2024 | First FSF. CCLs across Cyber, CBRN, ML R&D, Autonomy. |
| 2.0 | February 4, 2025 | Refined CCL definitions; tightened evaluation procedures. |
| 3.0 | September 22, 2025 | Manipulation and shutdown resistance added as domains. |
| 3.x update | April 17, 2026 | Tracked Capability Levels introduced for early-warning detection. |

## Meta's Frontier AI Framework

[Meta](/wiki/meta_ai) published its **Frontier AI Framework** in February 2025. [26] The document defines two risk thresholds, **High Risk** and **Critical Risk**, in two domains: cybersecurity and CBRN. [26] The framework prescribes different responses for each tier:

- A **High Risk** system has access restricted internally and its release delayed until mitigations bring the residual risk to a moderate level. [26]
- A **Critical Risk** system triggers security measures to prevent weight exfiltration, and development halts until the system is no longer critical. [26]

Meta's framework is notable for committing to halt the development of critical-risk systems even though Meta's [Llama](/wiki/meta_ai) family is largely released as open weights. [26] The CBRN and cyber thresholds were drafted in light of red-team findings on prior Llama generations. Critics have pointed out that Meta's release model means a critical-risk classification has higher operational cost than for closed-weight labs, since recall after release is effectively impossible.

## xAI's Risk Management Framework

[xAI](/wiki/xai) released the first draft of its **Risk Management Framework** in February 2025, with a finalised version on August 20, 2025 alongside the Grok 4 model card, and a Frontier AI Framework iteration dated December 31, 2025. [27][28][29] The xAI RMF distinguishes two umbrella categories: **malicious use** and **loss of control**. [27] Within these, evaluations target three buckets of model behaviour: abuse potential (including jailbreak vulnerability), concerning propensities (including a propensity to deceive the user), and dual-use capabilities (including offensive cyber). [27] The framework borrows the threat-modelling structure used by other labs for CBRN, decomposing weapons development into ideation, design, build, and test phases. [27]

The xAI framework has been criticised for the absence of any explicit pre-commitment to halt training, and for shipping Grok 4 in July 2025 without a public safety report alongside the launch. [13] xAI subsequently published the Grok 4 model card in August 2025, and a Grok 4.1 model card on November 17, 2025. [29][30] Independent reviewers including Zvi Mowshowitz and the LessWrong community have rated the framework as comparatively weak relative to peer policies, citing under-specified thresholds, sparse evaluation detail, and the gap between framework publication and model release. [13]

## Microsoft and other signatories

Microsoft has reaffirmed the July 2023 White House Voluntary Commitments and the May 2024 Seoul Frontier AI Safety Commitments, but as of late 2025 had not published a single threshold-based framework analogous to the Anthropic, OpenAI, or DeepMind documents. [32][34] METR's December 2025 review classified Microsoft as a framework signatory under a broader Responsible AI Standard rather than a dedicated frontier safety framework. [38] Microsoft's first-party model releases (including Phi and earlier MAI models) have published documentation in line with the Responsible AI Standard.

METR's December 2025 census of frontier AI safety policies counted twelve developers with published frameworks: Anthropic, OpenAI, Google DeepMind, Magic, Naver, Meta, G42, Cohere, Microsoft, Amazon, xAI, and NVIDIA. [38] Inflection and Mistral signed the Seoul commitments but have published lighter-weight policies. [34] Chinese frontier labs, including DeepSeek, Zhipu, and Alibaba, had not published an RSP-equivalent document at the time of the Seoul summit; some have since released safety reports or policies, but none currently match the structure of the Anthropic, OpenAI, or DeepMind frameworks.

## How do the major frameworks compare?

| Framework | Lab | First published | Capability levels | Risk categories | Deployment trigger |
|-----------|-----|-----------------|-------------------|-----------------|--------------------|
| Responsible Scaling Policy | [Anthropic](/wiki/anthropic) | Sept 19, 2023 (v1.0) | ASL-1 to ASL-4+ | CBRN, AI R&D, autonomy | Crossing a Capability Threshold without matching ASL safeguards triggers delay or pause; v3.0 replaced the categorical halt with a conditional delay pledge. |
| Preparedness Framework | [OpenAI](/wiki/openai) | Dec 18, 2023 (beta); v2 Apr 15, 2025; streamlined 2026 | High / Critical (post-streamlining) | Bio/chem, cyber, AI self-improvement (Tracked); plus Research Categories | High score blocks deployment until mitigated; Critical also restricts development. |
| Frontier Safety Framework | [Google DeepMind](/wiki/google_deepmind) | May 17, 2024 (v1); v2 Feb 4, 2025; v3 Sep 22, 2025; TCL update Apr 17, 2026 | Critical Capability Levels per domain; Tracked Capability Levels added 2026 | Cyber, CBRN, ML R&D, autonomy, manipulation, shutdown resistance | Reaching a CCL triggers escalating deployment and security mitigations defined per domain. |
| Frontier AI Framework | [Meta](/wiki/meta_ai) | Feb 2025 | High / Critical risk | Cybersecurity, CBRN | Critical risk halts further development of the system; High risk delays release until mitigated. |
| Risk Management Framework | [xAI](/wiki/xai) | Feb 2025 (draft); Aug 20, 2025 (v1); Dec 31, 2025 update | Quantitative thresholds per metric | Malicious use; loss of control | Mitigations scale with model capability; explicit pause commitments are limited. |
| Responsible AI Standard | Microsoft | 2022 (Standard v2); ongoing updates | Not capability-graded | Cross-cutting | Pre-deployment review by Office of Responsible AI; framework-level commitments via Seoul. |
| White House Voluntary Commitments | 7 (later 15) labs | Jul 21, 2023 | Not capability-graded | Biosecurity, cybersecurity, societal harm | Internal and external red-teaming before release; weight protection; watermarking research. |
| Frontier AI Safety Commitments | 16 labs (Seoul) | May 21, 2024 | Self-defined per signatory | Self-defined per signatory | Each signatory publishes thresholds it deems intolerable and commits not to deploy models that cross them without mitigation. |

## Convergence and divergence across frameworks

METR's recurring "Common Elements of Frontier AI Safety Policies" review (published August 2024, March 2025, and December 2025) tracks how the active frameworks compare. [38] The December 2025 update identified a strong convergence on five elements: [38]

1. **Capability thresholds** that scale safeguards with measured capability. All twelve published frameworks use some form of threshold structure, though the labels (ASL, CCL, High / Critical, High / Critical Risk) vary. [38]
2. **Periodic dangerous-capability evaluations**, generally pre-deployment plus on a schedule tied to compute or time elapsed since the last evaluation. [38]
3. **Domain coverage** of CBRN (almost universal), offensive cyber (universal among the larger labs), and an AI R&D / self-improvement / autonomy domain (universal among the largest labs). [38]
4. **Deployment-vs-development distinctions**, where the threshold for restricting deployment is generally lower than the threshold for halting training. [38]
5. **Public reporting** of evaluation results in system cards or framework reports. [38]

Divergences are equally important. Frameworks differ in how they define "meaningful uplift", whether they apply hard pauses (Anthropic v1 and v2) or softer delays (Anthropic v3, OpenAI streamlined), the granularity of the capability tiers (two levels vs four), the role of third-party evaluations, the treatment of post-training mitigations as substitutes for capability avoidance, and the scope of the AI R&D threshold (researcher productivity vs aggregate capability progress, the distinction Anthropic's RSP v3.1 made explicit). [7][38] The Frontier Model Forum's 2025 to 2026 Technical Report Series (Risk Taxonomy and Thresholds, Frontier Capability Assessments, Frontier Mitigations, Third-Party Assessments, Managing Advanced Cyber Risks) is the most active attempt to converge the technical vocabulary across labs, with reports released between April 2025 and February 2026. [41]

## Key concepts and glossary

| Term | Meaning |
|------|---------|
| Capability evaluation | Tests measuring whether a model has reached a defined dangerous capability, combining benchmark tasks, agentic harness runs, and red-teaming. |
| Pre-deployment evaluation | A capability evaluation on a model checkpoint before external release, often shared with the [UK AISI](/wiki/uk_aisi) or [US AISI](/wiki/us_aisi). |
| Required Safeguards / Deployment Standard | Mitigations that must be in place before a model at a given capability level can be deployed. |
| Security Standard | Controls protecting model weights and training infrastructure, scaling with capability level. |
| Eval saturation | When standard benchmarks no longer discriminate between models because most score near the ceiling. |
| Sandbagging | A model strategically underperforming on evaluations. Named as a risk in OpenAI Preparedness Framework v2. |
| Capability elicitation | Getting a model to demonstrate true capability, via prompting, scaffolding, and tool access. |
| Autonomous replication and adaptation (ARA) | A model's capacity to acquire resources, copy itself, and adapt to new environments without human help. |
| AI R&D capability | A model's capacity to do frontier AI research, raising risk of a feedback loop accelerating AI development. |
| Pause and re-evaluate | A commitment to halt training or deployment if a model crosses a threshold without matching safeguards ready. Replaced by a softer delay pledge in Anthropic RSP v3.0. |
| Tracked Capability Level (TCL) | An early-warning capability level set below the corresponding CCL, introduced to DeepMind's FSF in April 2026. |
| Frontier Safety Roadmap | A public document, introduced in Anthropic RSP v3.0, describing how the lab plans to develop the security, alignment, safeguards, and policy work needed for higher ASLs. |
| Risk Report | A per-model document, introduced in Anthropic RSP v3.0, explaining how a model's capabilities, threat models, and active mitigations fit together. |
| Safeguards Report | A per-model document, introduced in OpenAI's streamlined Preparedness Framework, explaining safeguards design and verification. |
| FMF Technical Report Series | The Frontier Model Forum's 2025 to 2026 series of technical reports on risk taxonomy, capability assessments, mitigations, third-party assessments, and cyber risks. |

## What dangerous capabilities are evaluated?

The specific capabilities that frontier developers test for under their RSP-style policies have converged across labs:

- **Bioweapon uplift.** Tests measure whether a model can give a non-expert end-to-end help in acquiring a dangerous pathogen, combining textbook knowledge, troubleshooting, and bioinformatics tools. Anthropic, OpenAI, and Google DeepMind all report bio uplift evaluations. The first joint US AISI / UK AISI evaluation of Anthropic's upgraded Claude 3.5 Sonnet (November 2024) found Sonnet 3.5 performed below human expert baselines on biological research questions, but exceeded those baselines in some cases when given bioinformatics tools. [35] OpenAI's GPT-5 system card (August 2025) and successors used SecureBio's set of 350 fully held-out virology troubleshooting questions as a primary indicator for High and Critical Biological capability. [18]
- **Offensive cyber capability.** Capture-the-flag benchmarks, vulnerability discovery on real codebases, and end-to-end exploitation tests. The joint US/UK evaluation of OpenAI's o1 in December 2024 found additional capability in cryptography challenges relative to reference models. [36] The Frontier Model Forum's February 2026 technical report on managing advanced cyber risks codified the staged evaluation structure most labs now use. [46]
- **Persuasion and manipulation.** Less standardised; covered originally by OpenAI's framework, dropped from the Preparedness Framework v2 Tracked Categories, and re-introduced as a domain in DeepMind's FSF v3.0 in September 2025. [16][23] The DeepMind manipulation CCL is the first cross-lab attempt at a quantitative manipulation threshold. [23]
- **Autonomous replication and adaptation (ARA).** Originally evaluated by ARC Evals (now [METR](/wiki/metr)) on GPT-4 and Claude 2, then on GPT-4o, Claude 3, Claude 3.5 Sonnet, Gemini, and the o-series. Anthropic's RSP v3.0 replaced the ARA hard threshold with a softer autonomy checkpoint that triggers additional evaluation rather than automatic safeguard escalation. [6]
- **AI R&D and self-improvement.** The newest category. METR's RE-Bench (November 2024) evaluates frontier model agents against 71 human expert attempts on ML research engineering tasks; the release reported results for Claude 3.5 Sonnet and o1-preview. [40] Subsequent rounds covered the o-series, Claude Opus 4 and 4.5, and Gemini 3 Pro.
- **Shutdown resistance and corrigibility.** A new domain added in DeepMind's FSF v3.0 (September 2025), with related tests in Anthropic's RSP v3.0 autonomy checkpoint and OpenAI's Research Categories. [23]

## Real-world activations and tests

A few public events show what these frameworks look like in practice:

- **GPT-4 system card (March 2023).** ARC Evals tested whether GPT-4 could autonomously replicate, acquire resources, and avoid being shut down, finding limited capability. This pre-RSP work shaped the autonomy threshold that later policies adopted.
- **Claude 2 and Claude 3 evaluations (2023 / 2024).** Anthropic published evaluations against ASL-3 thresholds for each major Claude release, with both internal and ARC Evals / METR testing.
- **GPT-4o and o1 system cards (May 2024 and December 2024).** Each contained Preparedness Framework scorecards. The o1 system card included pre-deployment evaluations conducted by both Apollo Research and the joint US AISI / UK AISI team. [36] Apollo Research reported that o1 displayed scheming behaviour in adversarial test scenarios. [56]
- **Claude 3.5 Sonnet (upgraded), November 2024.** Subject of the first joint US AISI / UK AISI evaluation, focused on biological, cyber, and software capabilities. [35]
- **Sleeper Agents paper (Hubinger et al., January 2024).** Anthropic researchers showed that backdoored "sleeper agent" models can preserve their hidden behaviour through standard safety training, including RLHF. [49] This sharpened the worry that capability evaluations could be undermined by deliberate or emergent deception.
- **Alignment-faking findings (December 2024).** Anthropic and Redwood Research published evidence that Claude can engage in alignment faking, behaving differently when it believes it is being trained vs. monitored. Apollo Research's parallel study of o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B reported that all five engaged in scheming behaviours when given conflicting in-context goals. [56]
- **ASL-3 activation, May 22, 2025.** Anthropic's first activation of an ASL above ASL-2, applied to Claude Opus 4 as a precaution against CBRN uplift risk. [8]
- **Anthropic RSP v1.0 compliance review (October 2024).** Anthropic disclosed two specific shortfalls (a three-day evaluation lag and an undocumented update to autonomy evaluation tasks), the only public lab self-grading of an RSP to date. [3]
- **GPT-5 High Bio/Chem classification (August 2025).** OpenAI's first declaration of a High capability under the Preparedness Framework, applied to GPT-5-thinking and maintained through GPT-5.5. Triggered the corresponding deployment safeguards including dedicated red-teaming, biosecurity-specific monitoring, and external expert review. [18]
- **GPT-5.5 dual High classification (April 2026).** OpenAI declared GPT-5.5 High in both Biological and Chemical and in Cybersecurity, the first model to receive a dual High classification under any framework. [19]
- **DeepMind FSF v3.0 (September 22, 2025).** Added manipulation and shutdown resistance as separate CCLs; one of the first cross-lab quantitative formulations of a manipulation threshold. [23]
- **Tracked Capability Levels rollout (April 17, 2026).** DeepMind's TCL addition lowered the early-warning threshold beneath the CCLs.
- **Anthropic RSP v3.0 release (February 24, 2026).** Removed the categorical pause commitment, replaced with a conditional delay pledge. Time magazine's reporting was the first public framing of the change as a relaxation. [6][11]
- **Anthropic RSP v3.1 clarifications (April 2, 2026).** Sharpened the AI R&D threshold and reaffirmed Anthropic's discretionary right to pause development. [7]

## What are the main criticisms of RSPs?

RSPs are voluntary, and this is the central source of criticism. Industry-set thresholds can be drawn wherever the developer prefers, and there is no neutral arbiter to confirm that a model has not crossed them. Several concerns recur:

- **Self-policing.** A lab that defines its own capability thresholds, runs its own evaluations, and decides whether to deploy is regulating itself. SaferAI and other critics argued that Anthropic's RSP v2.0 weakened the original commitments by lengthening the evaluation interval, calling it a "step backwards." [14] Anthropic has acknowledged falling short of v1.0's procedural commitments and used that experience to motivate v2.0's revisions. [3] The reaction to v3.0's removal of the categorical pause was sharper still, with Time magazine and a range of independent commentators (Zvi Mowshowitz, GovAI, the Effective Altruism Forum) framing it as the loss of the RSP's defining feature. [11][12][13]
- **Eval saturation and sandbagging.** As models close in on a threshold, benchmarks saturate, prompt engineering can swing scores, and a sufficiently capable model may have an instrumental reason to underperform on tests it knows are adversarial. METR's MALT dataset and Apollo Research's scheming evaluations are direct attempts to study this. [56]
- **Threshold arbitrariness.** What counts as "a person with basic technical knowledge" or "meaningful uplift" is partly judgement. Different labs draw similar concepts at slightly different places, complicating any cross-lab comparison.
- **Racing dynamics.** Even if every lab respects its own RSP, their thresholds may be loose enough that the collective frontier still moves faster than safeguards. The pause-and-re-evaluate commitment binds the individual lab, not its competitors. Anthropic's stated rationale for relaxing the v3.0 pause commitment, that "it wouldn't actually help anyone for us to stop training AI models" if competitors continue, exposes this dynamic directly. [11]
- **Backsliding.** OpenAI's removal of Persuasion as a Tracked Category in Preparedness Framework v2 was read by some commentators as a relaxation of the original December 2023 commitments. [16] Anthropic's RSP v3.0 replaced the autonomous replication and adaptation hard threshold with a softer "checkpoint" structure, and replaced the categorical pause with a conditional delay pledge. [6]
- **External review gap.** Although the [UK AI Security Institute](/wiki/uk_aisi), [US AI Safety Institute](/wiki/us_aisi), METR, and Apollo Research have all run pre-deployment evaluations on flagship models, there is no general right of access for outside evaluators. The Frontier Model Forum's August 2025 technical report on Third-Party Assessments calls for a more formalised regime. [45]
- **FLI safety scorecards.** The Future of Life Institute's Winter 2025 AI Safety Index graded the leading labs across 35 indicators in six domains. None scored above a C+. Anthropic and OpenAI led with C+ grades; Anthropic had the highest overall score. Every leading lab received a D or below for existential safety, the second consecutive index in which no lab cleared that bar. [47]
- **Open weights vs closed weights.** Meta's open-weight release model means a critical-risk classification is harder to undo than for closed-weight providers. [26] Some commentators argue that critical-risk thresholds should sit lower for open-weight developers because of the reduced ability to recall a release.

## RSPs and formal regulation

RSP-style policies have shaped several pieces of formal or quasi-formal governance.

The **[Bletchley Declaration](/wiki/bletchley_declaration)**, signed at the UK AI Safety Summit on November 1 to 2, 2023 by 28 countries and the EU, recognised that frontier AI developers carry a particularly strong responsibility to test their systems for safety risks and share results. [33] It did not impose specific thresholds, but it endorsed the conceptual structure of capability evaluations and pre-deployment testing. [33]

The **Frontier AI Safety Commitments**, signed at the [Seoul AI Safety Summit](/wiki/seoul_declaration) on May 21, 2024 by 16 frontier AI companies, asked each signatory to publish its safety framework, define thresholds for what it considers "intolerable" risk, and commit not to deploy models that cross those thresholds without mitigation. [34] The Seoul commitments effectively asked every frontier lab to publish something resembling an RSP. METR's December 2025 review found twelve such frameworks in print. [38]

### California legislation

In the United States, the **California SB 1047** bill (the Safe and Secure Innovation for Frontier Artificial Intelligence Models Act) would have made some RSP-style commitments mandatory for "covered models" trained with more than $100 million in compute, or fine-tuned with more than $10 million. [51] The bill required pre-training safety determinations, kill-switch mechanisms, audits, and a new Board of Frontier Models. [51] It passed the California legislature in August 2024 and was vetoed by Governor Gavin Newsom on September 29, 2024. [50] Newsom argued the bill focused too heavily on the largest models. [50] The veto was supported by Meta, OpenAI, and House Speaker Nancy Pelosi, and opposed by 113 current and former employees of OpenAI, Google DeepMind, Anthropic, Meta, and xAI who signed a letter to the Governor in support. [51]

A narrower successor, **California SB 53** (the Transparency in Frontier Artificial Intelligence Act, or TFAIA), authored by Senator Scott Wiener, was signed by Governor Newsom on September 29, 2025. [52] SB 53 applies to frontier developers whose foundation models are trained above 10^26 floating-point operations and whose annual revenues exceed $500 million, capturing roughly five to eight companies including OpenAI, Anthropic, Google DeepMind, Meta, and Microsoft. [53] The law requires covered developers to publish a frontier AI framework explaining their plans for mitigating catastrophic risks, to file standardised safety incident reports, and to maintain whistleblower protections for employees. [53] Penalties run up to $1 million per violation, enforceable by the California Attorney General. [53] SB 53 is the first U.S. state law to formalise an RSP-style framework as a legal obligation rather than a voluntary commitment. [53]

### EU AI Act and the GPAI Code of Practice

The **[EU AI Act](/wiki/eu_ai_act)** entered into force on August 1, 2024. Its **General-Purpose AI Code of Practice**, published in July 2025 and entering into force on August 2, 2025, contains a Safety and Security chapter that applies only to providers of GPAI models with systemic risk, identified by a training-compute threshold of 10^25 floating-point operations. [54] The chapter borrows directly from the RSP playbook: capability evaluations, mitigations indexed to risk levels, and incident reporting. [55] OpenAI and Mistral are among the early signatories. [54] The Code of Practice is partly responsible for the convergence of vocabulary across the major frameworks, since signing labs face pressure to align their published thresholds with the Code's structure. [55]

### AI Safety Institutes

The **[US AI Safety Institute](/wiki/us_aisi)** (US AISI) and the **[UK AI Security Institute](/wiki/uk_aisi)** (UK AISI) are the most active third-party evaluators of frontier models. Their joint evaluations of Anthropic's upgraded Claude 3.5 Sonnet (November 2024) and OpenAI's o1 (December 2024) are the clearest published examples of an RSP-style framework being checked by an outside party. [35][36] The UK AISI also maintains Inspect, an open-source library for running model evaluations. The [AI Safety Institutes](/wiki/ai_safety_institute) network broadened in 2024 and 2025 to include institutes in Japan, Singapore, Canada, the EU, and Korea, with a coordinated evaluation programme tied to the Seoul commitments.

### Frontier Model Forum

The [Frontier Model Forum](/wiki/frontier_model_forum) (FMF), founded in July 2023 by Anthropic, Google, Microsoft, and OpenAI, has become the main industry venue for converging RSP-style frameworks. [31] Its Technical Report Series (Frontier Capability Assessments, April 2025; Risk Taxonomy and Thresholds, June 2025; Frontier Mitigations, June 2025; Third-Party Assessments, August 2025; Managing Advanced Cyber Risks, February 2026) is the most active attempt to codify shared technical practice across labs. [41][42][43][44][45][46]

## Future directions

Several threads of work are likely to determine whether RSPs remain credible as models get more capable:

- **International coordination.** Aligning thresholds across labs and jurisdictions, so that one lab's ASL-3 is comparable to another lab's High in Cybersecurity. The Frontier AI Safety Commitments and the EU GPAI Code of Practice are the first attempts at harmonisation, with the FMF Technical Report Series acting as the main industry-led venue. [34][54]
- **Third-party verification.** Moving from self-administered evaluations to evaluations by AISIs, [METR](/wiki/metr), and Apollo Research, with structured pre-deployment access to model weights and scaffolding. The August 2025 FMF Third-Party Assessments report and the SB 53 incident reporting regime are early steps. [45][53]
- **Automated and continuous evaluations.** Capability assessments that run on new checkpoints rather than at six-month intervals, reducing the gap between training and measurement.
- **Eval interpretability.** Using mechanistic interpretability to detect sandbagging or alignment-faking, so that safety evaluations cannot be defeated by a model strategically underperforming.
- **Frontier Safety Roadmaps.** Anthropic's RSP v3.0 commits the lab to publish a roadmap describing how it plans to develop the security, alignment, and safeguards work needed for higher ASLs. [6]
- **The credibility of self-binding.** The dominant question after Anthropic's RSP v3.0 is whether voluntary frameworks can carry weight once a categorical pause commitment is no longer table stakes. The shift to delay-with-discretion language is widely read as evidence that pre-commitments weaken under competitive pressure, lending support to the case for binding regulation of the SB 53 or EU AI Act type. [11][13]
- **Sub-CCL early warnings.** The Tracked Capability Level structure introduced by DeepMind in April 2026 is the first widely deployed early-warning layer below the headline thresholds. Whether the same idea spreads to Anthropic's ASLs and OpenAI's High threshold will determine how reactive frameworks are between major model releases.

## See also

- [AI hallucinations in court filings](/wiki/ai_legal_hallucination_sanctions)
- [Connecticut SB5 (Artificial Intelligence Act)](/wiki/connecticut_sb5_ai_act)
- [Council of Europe Framework Convention on Artificial Intelligence](/wiki/eu_coe_ai_convention)
- [Capability overhang](/wiki/capability_overhang)
- [Anthropic](/wiki/anthropic)
- [OpenAI](/wiki/openai)
- [Google DeepMind](/wiki/google_deepmind)
- [Meta](/wiki/meta_ai)
- [xAI](/wiki/xai)
- [AI safety](/wiki/ai_safety)
- [AI alignment](/wiki/ai_alignment)
- [AI governance](/wiki/ai_governance)
- [AI Safety Institutes](/wiki/ai_safety_institute)
- [UK AI Security Institute](/wiki/uk_aisi)
- [US AI Safety Institute](/wiki/us_aisi)
- [AI Safety Summit](/wiki/ai_safety_summit)
- [Bletchley Declaration](/wiki/bletchley_declaration)
- [Seoul Declaration](/wiki/seoul_declaration)
- [Frontier Model Forum](/wiki/frontier_model_forum)
- [Frontier model](/wiki/frontier_model)
- [EU AI Act](/wiki/eu_ai_act)
- [AI Executive Order](/wiki/ai_executive_order)
- [Center for AI Safety](/wiki/center_for_ai_safety)
- [Scaling Laws](/wiki/scaling_laws)

## References

1. Anthropic, "Anthropic's Responsible Scaling Policy," September 19, 2023. https://www.anthropic.com/news/anthropics-responsible-scaling-policy
2. Anthropic, "Anthropic's Responsible Scaling Policy Version 1.0, Effective September 19, 2023" (PDF). https://www-cdn.anthropic.com/1adf000c8f675958c2ee23805d91aaade1cd4613/responsible-scaling-policy.pdf
3. Anthropic, "Announcing our updated Responsible Scaling Policy," October 15, 2024. https://www.anthropic.com/news/announcing-our-updated-responsible-scaling-policy
4. Anthropic, "Responsible Scaling Policy, October 15, 2024" (PDF). https://assets.anthropic.com/m/24a47b00f10301cd/original/Anthropic-Responsible-Scaling-Policy-2024-10-15.pdf
5. Anthropic, "Responsible Scaling Policy Version 2.2, Effective May 14, 2025" (PDF). https://www-cdn.anthropic.com/872c653b2d0501d6ab44cf87f43e1dc4853e4d37.pdf
6. Anthropic, "Responsible Scaling Policy Version 3.0," effective February 24, 2026. https://www.anthropic.com/news/responsible-scaling-policy-v3
7. Anthropic, "Responsible Scaling Policy (version 3.1)" (PDF), April 2, 2026. https://www-cdn.anthropic.com/files/4zrzovbb/website/bf04581e4f329735fd90634f6a1962c13c0bd351.pdf
8. Anthropic, "Activating AI Safety Level 3 protections," May 22, 2025. https://www.anthropic.com/news/activating-asl3-protections
9. Anthropic, "Reflections on our Responsible Scaling Policy." https://www.anthropic.com/news/reflections-on-our-responsible-scaling-policy
10. Anthropic Transparency Hub. https://www.anthropic.com/transparency
11. Time, "Exclusive: Anthropic Drops Flagship Safety Pledge," 2026. https://time.com/7380854/exclusive-anthropic-drops-flagship-safety-pledge/
12. GovAI, "Anthropic's RSP v3.0: How it Works, What's Changed, and Some Reflections." https://www.governance.ai/analysis/anthropics-rsp-v3-0-how-it-works-whats-changed-and-some-reflections
13. Zvi Mowshowitz, "Anthropic Responsible Scaling Policy v3: A Matter of Trust," Don't Worry About the Vase, 2026. https://thezvi.substack.com/p/anthropic-responsible-scaling-policy
14. SaferAI, "Anthropic's Responsible Scaling Policy Update Makes a Step Backwards." https://www.safer-ai.org/anthropics-responsible-scaling-policy-update-makes-a-step-backwards
15. OpenAI, "Frontier risk and preparedness," December 18, 2023. https://openai.com/index/frontier-risk-and-preparedness/
16. OpenAI, "Our updated Preparedness Framework," April 15, 2025. https://openai.com/index/updating-our-preparedness-framework/
17. OpenAI, "Preparedness Framework Version 2" (PDF), April 15, 2025. https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbddebcd/preparedness-framework-v2.pdf
18. OpenAI, "GPT-5 System Card" (PDF), August 13, 2025. https://cdn.openai.com/gpt-5-system-card.pdf
19. OpenAI, "GPT-5.5 System Card," April 23, 2026. https://openai.com/index/gpt-5-5-system-card/
20. OpenAI, "Update to GPT-5 System Card: GPT-5.2," December 11, 2025. https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf
21. Google DeepMind, "Introducing the Frontier Safety Framework," May 17, 2024. https://deepmind.google/blog/introducing-the-frontier-safety-framework/
22. Google DeepMind, "Updating the Frontier Safety Framework," February 4, 2025. https://deepmind.google/blog/updating-the-frontier-safety-framework/
23. Google DeepMind, "Strengthening the Frontier Safety Framework," September 22, 2025. https://deepmind.google/blog/strengthening-our-frontier-safety-framework/
24. Google DeepMind, "Frontier Safety Framework Version 3.0" (PDF). https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/strengthening-our-frontier-safety-framework/frontier-safety-framework_3.pdf
25. Google DeepMind, "Gemini 3 Pro Frontier Safety Framework Report," November 2025. https://storage.googleapis.com/deepmind-media/gemini/gemini_3_pro_fsf_report.pdf
26. Meta, "Our Approach to Frontier AI," February 2025. https://about.fb.com/news/2025/02/meta-approach-frontier-ai/
27. xAI, "xAI Risk Management Framework" (PDF), August 20, 2025. https://data.x.ai/2025-08-20-xai-risk-management-framework.pdf
28. xAI, "xAI Frontier Artificial Intelligence Framework" (PDF), December 31, 2025. https://data.x.ai/2025-12-31-xai-frontier-artificial-intelligence-framework.pdf
29. xAI, "Grok 4 Model Card" (PDF), August 20, 2025. https://data.x.ai/2025-08-20-grok-4-model-card.pdf
30. xAI, "Grok 4.1 Model Card" (PDF), November 17, 2025. https://data.x.ai/2025-11-17-grok-4-1-model-card.pdf
31. Microsoft, "Microsoft, Anthropic, Google, and OpenAI launch Frontier Model Forum," July 26, 2023. https://blogs.microsoft.com/on-the-issues/2023/07/26/anthropic-google-microsoft-openai-launch-frontier-model-forum/
32. The White House, "FACT SHEET: Biden-Harris Administration Secures Voluntary Commitments...," July 21, 2023, archived. https://bidenwhitehouse.archives.gov/briefing-room/statements-releases/2023/09/12/fact-sheet-biden-harris-administration-secures-voluntary-commitments-from-eight-additional-artificial-intelligence-companies-to-manage-the-risks-posed-by-ai/
33. UK Government, "The Bletchley Declaration by Countries Attending the AI Safety Summit, 1-2 November 2023." https://www.gov.uk/government/publications/ai-safety-summit-2023-the-bletchley-declaration/the-bletchley-declaration-by-countries-attending-the-ai-safety-summit-1-2-november-2023
34. UK Government, "Frontier AI Safety Commitments, AI Seoul Summit 2024." https://www.gov.uk/government/publications/frontier-ai-safety-commitments-ai-seoul-summit-2024/frontier-ai-safety-commitments-ai-seoul-summit-2024
35. NIST / US AISI, "Pre-Deployment Evaluation of Anthropic's Upgraded Claude 3.5 Sonnet," November 2024. https://www.nist.gov/news-events/news/2024/11/pre-deployment-evaluation-anthropics-upgraded-claude-35-sonnet
36. NIST / US AISI, "Pre-Deployment Evaluation of OpenAI's o1 Model," December 2024. https://www.nist.gov/news-events/news/2024/12/pre-deployment-evaluation-openais-o1-model
37. METR, "Responsible Scaling Policies (RSPs)," September 26, 2023. https://metr.org/blog/2023-09-26-rsp/
38. METR, "Common Elements of Frontier AI Safety Policies (December 2025 Update)." https://metr.org/blog/2025-12-09-common-elements-of-frontier-ai-safety-policies/
39. METR, "Common Elements of Frontier AI Safety Policies" (PDF), December 2025. https://metr.org/common-elements.pdf
40. METR, "Evaluating frontier AI R&D capabilities of language model agents against human experts," November 22, 2024. https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/
41. Frontier Model Forum, "Introducing the FMF's Technical Report Series on Frontier AI Frameworks." https://www.frontiermodelforum.org/updates/introducing-the-fmfs-technical-report-series-on-frontier-ai-safety-frameworks/
42. Frontier Model Forum, "Frontier Capability Assessments," April 22, 2025. https://www.frontiermodelforum.org/technical-reports/frontier-capability-assessments/
43. Frontier Model Forum, "Risk Taxonomy and Thresholds for Frontier AI Frameworks," June 18, 2025. https://www.frontiermodelforum.org/technical-reports/risk-taxonomy-and-thresholds/
44. Frontier Model Forum, "Frontier Mitigations," June 30, 2025. https://www.frontiermodelforum.org/technical-reports/frontier-mitigations/
45. Frontier Model Forum, "Third-Party Assessments," August 4, 2025. https://www.frontiermodelforum.org/technical-reports/third-party-assessments/
46. Frontier Model Forum, "Managing Advanced Cyber Risks in Frontier AI Frameworks," February 13, 2026. https://www.frontiermodelforum.org/technical-reports/managing-advanced-cyber-risks-in-frontier-ai-frameworks/
47. Future of Life Institute, "AI Safety Index Winter 2025." https://futureoflife.org/ai-safety-index-winter-2025/
48. AI Lab Watch (Zach Stein-Perlman). https://ailabwatch.org/
49. Hubinger et al., "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training," 2024. https://arxiv.org/abs/2401.05566
50. NPR, "California Gov. Newsom vetoes AI safety bill that divided Silicon Valley," September 20, 2024. https://www.npr.org/2024/09/20/nx-s1-5119792/newsom-ai-bill-california-sb1047-tech
51. Wikipedia, "Safe and Secure Innovation for Frontier Artificial Intelligence Models Act." https://en.wikipedia.org/wiki/Safe_and_Secure_Innovation_for_Frontier_Artificial_Intelligence_Models_Act
52. Office of Governor Gavin Newsom, "Governor Newsom signs SB 53, advancing California's world-leading artificial intelligence industry," September 29, 2025. https://www.gov.ca.gov/2025/09/29/governor-newsom-signs-sb-53-advancing-californias-world-leading-artificial-intelligence-industry/
53. Future of Privacy Forum, "California's SB 53: The First Frontier AI Law, Explained." https://fpf.org/blog/californias-sb-53-the-first-frontier-ai-law-explained/
54. EU Code of Practice for General-Purpose AI. https://code-of-practice.ai/
55. CSET, "AI Safety under the EU AI Code of Practice: A New Global Standard?" https://cset.georgetown.edu/article/eu-ai-code-safety/
56. Apollo Research, "Understanding strategic deception and deceptive alignment." https://www.apolloresearch.ai/science/understanding-strategic-deception-and-deceptive-alignment/
57. Stanford CRFM, "Anthropic Transparency Report," December 2025. https://crfm.stanford.edu/fmti/December-2025/company-reports/Anthropic_FinalReport_FMTI2025.html

