Responsible Scaling Policy
Last reviewed
May 7, 2026
Sources
57 citations
Review status
Source-backed
Revision
v2 ยท 7,801 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 7, 2026
Sources
57 citations
Review status
Source-backed
Revision
v2 ยท 7,801 words
Add missing citations, update stale details, or suggest a clearer explanation.
A Responsible Scaling Policy (RSP) is a self-imposed governance framework in which an AI lab commits in advance to safety practices, capability evaluations, deployment restrictions, and security measures that scale with the capability of its frontier models. The framework is triggered when models cross defined capability thresholds, often called "AI Safety Levels" (ASLs), "Critical Capability Levels" (CCLs), or numerically graded risk tiers. A developer publicly pre-commits to safeguards before its models reach dangerous capability levels, so decisions about training or deployment are constrained by rules written before commercial pressure to ship arrives.
The first RSP was published by [[anthropic|Anthropic]] on September 19, 2023. The label remains specific to Anthropic, but other frontier developers have since published functionally similar documents: [[openai|OpenAI]]'s Preparedness Framework (December 18, 2023, with v2 in April 2025), [[google_deepmind|Google DeepMind]]'s Frontier Safety Framework (May 17, 2024, with later versions in February 2025, September 2025, and April 2026), and [[meta_ai|Meta]]'s Frontier AI Framework (February 2025). RSP-style policies are now a standard component of [[ai_governance|AI governance]] and are referenced in the [[eu_ai_act|EU AI Act]]'s General-Purpose AI Code of Practice (August 2025), the [[bletchley_declaration|Bletchley Declaration]] (November 2023), and the Frontier AI Safety Commitments signed at the [[seoul_declaration|Seoul AI Safety Summit]] (May 2024). By December 2025, [[metr|METR]] counted twelve frontier developers with published frameworks, including Anthropic, OpenAI, Google DeepMind, Magic, Naver, Meta, G42, Cohere, Microsoft, Amazon, xAI, and NVIDIA.
A Responsible Scaling Policy has four standard ingredients:
The policy's bite comes from the conditional structure: a developer commits in advance that, if a future evaluation shows a model crossing a threshold and the matching safeguards are not in place, it will halt deployment or training rather than ship the model. This removes the live commercial decision from the moment of measurement and replaces it with a rule written months earlier. As discussed below, the strength of that pre-commitment has weakened in some recent revisions.
Anthropic published RSP v1.0 on September 19, 2023. The document defined four AI Safety Levels modelled, in Anthropic's framing, on the U.S. biosafety level (BSL) standards for dangerous biological materials. Higher ASLs require increasingly strict demonstrations of safety before models can be developed or deployed.
The original ASL definitions:
| Level | Description | Examples |
|---|---|---|
| ASL-1 | Systems that pose no meaningful catastrophic risk. | A 2018 LLM, or an AI system that only plays chess. |
| ASL-2 | Systems that show early signs of dangerous capabilities (such as describing how to build bioweapons) but where the information is not yet useful in practice due to insufficient reliability or because it is not meaningfully better than what a determined searcher could find. | Frontier LLMs at the time of publication, including [[claude |
| ASL-3 | Systems that substantially increase the risk of catastrophic misuse compared to non-AI baselines (such as a search engine or a textbook), or that show low-level autonomous capabilities. | Not yet reached at the time of v1.0. |
| ASL-4 and higher | Reserved for future revisions; involves qualitative escalations in catastrophic misuse potential and autonomy. | Speculative. |
For each ASL, Anthropic specified a "Required Safeguards" set: a deployment standard governing external use, and a security standard for weights and training infrastructure. ASL-3 added enhanced internal access controls on weights, hardened deployment with abuse monitoring, and red-teaming designed to demonstrate the model could not provide meaningful uplift to a CBRN attacker.
v1.0 also formalised governance practices: board approval for any policy change, a designated Responsible Scaling Officer, public reporting of evaluation results, and a commitment to pause scaling if the safeguards for a model's capability level were not yet ready.
Anthropic published RSP v2.0 on October 15, 2024. The update kept the ASL structure but reorganised the policy around named Capability Thresholds that, when crossed, place a model under the corresponding ASL safeguards. The two thresholds defined explicitly in v2.0 are:
v2.0 clarified that ASL labels refer to groups of safeguards, not to models; a model can be operating under ASL-2 standards while approaching the CBRN-3 threshold. The evaluation cadence was lengthened from roughly three months to six months, on the rationale that the previous interval did not leave enough time for high-quality capability elicitation. Anthropic disclosed that its first round of internal compliance had fallen short of v1.0's procedural commitments; this motivated the rewrite.
Jared Kaplan, Anthropic's chief science officer, took on the role of Responsible Scaling Officer at the v2.0 transition, with a separate Head of Responsible Scaling position created for day-to-day work.
Alongside the v2.0 release, Anthropic conducted an internal review of its compliance with v1.0 and disclosed two specific shortfalls:
Anthropic concluded the lapses posed minimal risk in the specific cases, but cited them as evidence that a tight three-month cadence and rigid evaluation suite had become brittle. The episode is one of the few public accounts of an RSP signatory grading itself against its own document. Critics including SaferAI argued the response weakened the document overall.
Minor updates were issued as v2.1 (March 31, 2025) and v2.2 (May 14, 2025). v2.1 clarified the procedure for evaluating ASL-3 deployment safeguards, including how the criterion of "sufficiently minimising the residual risk" should be interpreted. v2.2, effective the day before [[claude|Claude]] Opus 4 launch, fine-tuned the threat models for CBRN uplift and clarified that holding back capability via post-training restrictions could substitute for raw threshold avoidance only if the restriction itself was robust to jailbreaks.
The most consequential operational event came on May 22, 2025: alongside the launch of [[claude|Claude]] Opus 4 and Claude Sonnet 4, Anthropic announced that it was activating the ASL-3 Deployment and Security Standards for Claude Opus 4. The company had not definitively confirmed Opus 4 crossed the CBRN-3 threshold, but continued improvements in CBRN-related knowledge meant ASL-3 risks could not be ruled out. This was the first real-world activation of an ASL above ASL-2.
The ASL-3 measures included "constitutional classifiers" trained to detect and refuse CBRN-related queries, more than 100 new internal security controls including egress bandwidth restrictions to slow any exfiltration of model weights, and two-party authorisation for sensitive operations on production assets. Subsequent Opus releases ([[claude_opus_4_5|Claude Opus 4.5]] in November 2025, [[claude_opus_4_6|Claude Opus 4.6]] in early 2026, and [[claude_opus_4_7|Claude Opus 4.7]]) shipped under the same ASL-3 standard. [[claude_sonnet_4_5|Claude Sonnet 4.5]] became the first Sonnet-tier model to require ASL-3, while [[claude_haiku_4_5|Claude Haiku 4.5]] launched at ASL-2 after Anthropic's evaluations placed it below the rule-out threshold for biology and autonomy.
Version 3.0 of the RSP became effective February 24, 2026. v3.0 was a structural overhaul rather than an incremental edit. It introduced a public Frontier Safety Roadmap describing concrete plans for risk mitigations across Security, Alignment, Safeguards, and Policy, and a system of Risk Reports explaining how a given model's capabilities, threat models, and active mitigations fit together. v3.0 disaggregated the AI R&D threshold into two distinct levels, added a new CBRN development capability threshold, and replaced the autonomous replication and adaptation threshold with an autonomous-capabilities "checkpoint" that triggers additional evaluation rather than automatic safeguard escalation.
The most controversial change was the removal of the categorical pause commitment. Earlier versions barred Anthropic from training or deploying a model that crossed a Capability Threshold without matching safeguards in place. v3.0 replaced that hard stop with a softer pledge to "delay" development in cases where leadership judges Anthropic to be the AI race leader and the catastrophic-risk concerns to be significant. Time magazine's exclusive on the change, headlined "Anthropic Drops Flagship Safety Pledge", framed it as a step away from the original RSP philosophy. Jared Kaplan defended the change publicly: "We felt that it wouldn't actually help anyone for us to stop training AI models" if competitors continued. Critics including Zvi Mowshowitz characterised it as the loss of the RSP's core load-bearing commitment.
Version 3.1 followed on April 2, 2026 with two clarifications added in response to reader feedback on v3.0:
The second clarification was widely read as a response to the framing of the v3.0 changes as a relaxation, although it did not restore the categorical pause commitment.
| Version | Effective | Key changes |
|---|---|---|
| 1.0 | September 19, 2023 | First RSP. Defines ASL-1 to ASL-4. Three-month evaluation cadence. |
| 2.0 | October 15, 2024 | Capability Thresholds (CBRN-3, AI R&D-4, AI R&D-5). Six-month cadence. ASL applied to safeguards, not models. Compliance review of v1.0 published. |
| 2.1 | March 31, 2025 | Clarified ASL-3 deployment criterion and "sufficiently minimising" residual risk. |
| 2.2 | May 14, 2025 | Tightened CBRN uplift threat models; clarified jailbreak-robust post-training restrictions. |
| 3.0 | February 24, 2026 | Frontier Safety Roadmap; Risk Reports; CBRN development threshold; AI R&D split into two levels; ARA replaced with autonomy checkpoint; categorical pause replaced with conditional delay pledge. |
| 3.1 | April 2, 2026 | Sharpened AI R&D threshold definition; reaffirmed Anthropic may pause development independently of RSP requirements. |
| Model | Released | ASL standard | Notes |
|---|---|---|---|
| [[claude | Claude 1]] | March 2023 | ASL-2 |
| Claude 2 | July 2023 | ASL-2 | Evaluated by ARC Evals against early ASL-3 indicators. |
| Claude 3 family | March 2024 | ASL-2 | ARA evaluations conducted by [[metr |
| Claude 3.5 Sonnet (upgraded) | October 2024 | ASL-2 | First joint US AISI / UK AISI evaluation. |
| [[claude | Claude 3.7 Sonnet]] | February 2025 | ASL-2 |
| Claude Sonnet 4 | May 22, 2025 | ASL-2 | Released alongside Opus 4. |
| [[claude | Claude Opus 4]] | May 22, 2025 | ASL-3 |
| Claude Opus 4.1 | August 5, 2025 | ASL-3 | Coding and tool-use refinements. |
| [[claude_sonnet_4_5 | Claude Sonnet 4.5]] | September 2025 | ASL-3 |
| [[claude_haiku_4_5 | Claude Haiku 4.5]] | October 2025 | ASL-2 |
| [[claude_opus_4_5 | Claude Opus 4.5]] | November 24, 2025 | ASL-3 |
| Claude Sonnet 4.6 | Late 2025 / early 2026 | ASL-3 | Performed at or below Opus 4.6 on automated evals. |
| [[claude_opus_4_6 | Claude Opus 4.6]] | 2026 | ASL-3 |
| [[claude_opus_4_7 | Claude Opus 4.7]] | 2026 | ASL-3 |
OpenAI published the beta of its Preparedness Framework on December 18, 2023, the first RSP-style policy from OpenAI. It was authored by a newly created Preparedness team initially led by Aleksander Madry, reorganised in 2024 after several departures.
The original framework tracked four risk categories, each scored Low, Medium, High, or Critical:
| Category | Examples of risk |
|---|---|
| Cybersecurity | Uplift to offensive cyber operations, automated discovery and exploitation of vulnerabilities. |
| CBRN | Chemical, biological, radiological, and nuclear weapons assistance. |
| Persuasion | Highly tailored mass persuasion or manipulation. |
| Model autonomy | Self-exfiltration, autonomous replication, autonomous economic activity. |
The deployment rules: a model at High in any category can only be deployed once mitigations bring the post-mitigation score below High; a model at Critical in any category "cannot be developed" until the score is reduced.
The Preparedness Framework was updated to version 2 on April 15, 2025. v2 narrowed the Tracked Categories to Biological and Chemical, Cybersecurity, and AI Self-Improvement, and introduced separate Research Categories (Long-range Autonomy, Sandbagging, Autonomous Replication and Adaptation, Undermining Safeguards, and Nuclear and Radiological). Persuasion was moved out of the Preparedness Framework, a change some observers read as softening the original commitments. The four-tier Low/Medium/High/Critical scoring was retained.
A later refresh of the Preparedness Framework collapsed the deployment-relevant scoring to two thresholds. The Low and Medium tiers were dropped from the headline classification because they did not, on their own, trigger any specific safeguards. A model that crosses the High threshold can only be deployed after safeguards "sufficiently minimise" the associated risk; a Critical classification additionally restricts development. Each release covered by the framework now ships with a Safeguards Report documenting how each safeguard was designed and verified. The change also introduced a sharper expectation about what "sufficiently minimise" means in practice and aligned the framework's vocabulary more closely with the EU GPAI Code of Practice.
OpenAI's pre-deployment evaluations under the framework have been published in system cards for GPT-4 Turbo, GPT-4o, [[o1|o1]], o3, o4-mini, GPT-4.5, and the GPT-5 family. The most consequential public classification came with [[gpt-5|GPT-5]]: OpenAI declared GPT-5 (specifically GPT-5-thinking) High capability in the Biological and Chemical domain and activated the associated safeguards. Successive system cards ([[gpt-5.1|GPT-5.1]], GPT-5.2, GPT-5.3-Codex, GPT-5.4, GPT-5.5) maintained the High classification in Biological and Chemical and added cybersecurity risk evaluation for GPT-5.5. GPT-5.5 was treated as High in both Biological and Chemical and in Cybersecurity, with the safeguards mix expanded accordingly. No OpenAI model has been classified as Critical to date.
| Version | Date | Headline change |
|---|---|---|
| Beta (v1) | December 18, 2023 | Four Tracked Categories: Cyber, CBRN, Persuasion, Model autonomy. Low / Medium / High / Critical scale. |
| Version 2 | April 15, 2025 | Tracked Categories narrowed to Bio/Chem, Cyber, AI Self-Improvement. Persuasion moved out. New Research Categories. |
| Streamlined revision | 2026 | Levels collapsed to High and Critical. Safeguards Reports introduced. "Sufficiently minimise" criterion sharpened. |
Google [[google_deepmind|DeepMind]] published the first version of its Frontier Safety Framework (FSF) on May 17, 2024, with v2.0 on February 4, 2025, v3.0 on September 22, 2025, and a further iteration on April 17, 2026 introducing Tracked Capability Levels. The FSF is structured around Critical Capability Levels (CCLs), capability thresholds at which a model could pose heightened risk of severe harm absent mitigations. CCLs are derived domain-by-domain: for each, DeepMind defines the minimal capability profile that would unlock that harm.
The FSF identifies CCLs across:
The framework operates in three stages: identify capabilities that could cause severe harm, run periodic "early warning" evaluations as frontier models approach a CCL, and apply escalating deployment and security mitigations once a CCL is reached. Each major model release (Gemini 2.5 Pro, the [[gemini_3|Gemini 3]] series) is accompanied by an FSF report.
In April 2026, DeepMind added Tracked Capability Levels (TCLs) in selected domains. A TCL is set below the corresponding CCL and is meant to flag less extreme but still concerning capabilities earlier in the development cycle, allowing safeguard work to begin before the harder threshold is reached. TCLs align the FSF more closely with the staged review structure now common in other frameworks.
The Gemini 3 Pro FSF report (November 2025) reported that the model showed a statistically significant difference from non-AI baselines on most manipulative-efficacy metrics, but no statistically significant difference between Gemini 2.5 Pro and Gemini 3 Pro versions on the same metrics. No model has yet been declared to have reached a CCL.
| Version | Date | Headline change |
|---|---|---|
| 1.0 | May 17, 2024 | First FSF. CCLs across Cyber, CBRN, ML R&D, Autonomy. |
| 2.0 | February 4, 2025 | Refined CCL definitions; tightened evaluation procedures. |
| 3.0 | September 22, 2025 | Manipulation and shutdown resistance added as domains. |
| 3.x update | April 17, 2026 | Tracked Capability Levels introduced for early-warning detection. |
[[meta_ai|Meta]] published its Frontier AI Framework in February 2025. The document defines two risk thresholds, High Risk and Critical Risk, in two domains: cybersecurity and CBRN. The framework prescribes different responses for each tier:
Meta's framework is notable for committing to halt the development of critical-risk systems even though Meta's [[meta_ai|Llama]] family is largely released as open weights. The CBRN and cyber thresholds were drafted in light of red-team findings on prior Llama generations. Critics have pointed out that Meta's release model means a critical-risk classification has higher operational cost than for closed-weight labs, since recall after release is effectively impossible.
[[xai|xAI]] released the first draft of its Risk Management Framework in February 2025, with a finalised version on August 20, 2025 alongside the Grok 4 model card, and a Frontier AI Framework iteration dated December 31, 2025. The xAI RMF distinguishes two umbrella categories: malicious use and loss of control. Within these, evaluations target three buckets of model behaviour: abuse potential (including jailbreak vulnerability), concerning propensities (including a propensity to deceive the user), and dual-use capabilities (including offensive cyber). The framework borrows the threat-modelling structure used by other labs for CBRN, decomposing weapons development into ideation, design, build, and test phases.
The xAI framework has been criticised for the absence of any explicit pre-commitment to halt training, and for shipping Grok 4 in July 2025 without a public safety report alongside the launch. xAI subsequently published the Grok 4 model card in August 2025, and a Grok 4.1 model card on November 17, 2025. Independent reviewers including Zvi Mowshowitz and the LessWrong community have rated the framework as comparatively weak relative to peer policies, citing under-specified thresholds, sparse evaluation detail, and the gap between framework publication and model release.
Microsoft has reaffirmed the July 2023 White House Voluntary Commitments and the May 2024 Seoul Frontier AI Safety Commitments, but as of late 2025 had not published a single threshold-based framework analogous to the Anthropic, OpenAI, or DeepMind documents. METR's December 2025 review classified Microsoft as a framework signatory under a broader Responsible AI Standard rather than a dedicated frontier safety framework. Microsoft's first-party model releases (including Phi and earlier MAI models) have published documentation in line with the Responsible AI Standard.
METR's December 2025 census of frontier AI safety policies counted twelve developers with published frameworks: Anthropic, OpenAI, Google DeepMind, Magic, Naver, Meta, G42, Cohere, Microsoft, Amazon, xAI, and NVIDIA. Inflection and Mistral signed the Seoul commitments but have published lighter-weight policies. Chinese frontier labs, including DeepSeek, Zhipu, and Alibaba, had not published an RSP-equivalent document at the time of the Seoul summit; some have since released safety reports or policies, but none currently match the structure of the Anthropic, OpenAI, or DeepMind frameworks.
| Framework | Lab | First published | Capability levels | Risk categories | Deployment trigger |
|---|---|---|---|---|---|
| Responsible Scaling Policy | [[anthropic | Anthropic]] | Sept 19, 2023 (v1.0) | ASL-1 to ASL-4+ | CBRN, AI R&D, autonomy |
| Preparedness Framework | [[openai | OpenAI]] | Dec 18, 2023 (beta); v2 Apr 15, 2025; streamlined 2026 | High / Critical (post-streamlining) | Bio/chem, cyber, AI self-improvement (Tracked); plus Research Categories |
| Frontier Safety Framework | [[google_deepmind | Google DeepMind]] | May 17, 2024 (v1); v2 Feb 4, 2025; v3 Sep 22, 2025; TCL update Apr 17, 2026 | Critical Capability Levels per domain; Tracked Capability Levels added 2026 | Cyber, CBRN, ML R&D, autonomy, manipulation, shutdown resistance |
| Frontier AI Framework | [[meta_ai | Meta]] | Feb 2025 | High / Critical risk | Cybersecurity, CBRN |
| Risk Management Framework | [[xai | xAI]] | Feb 2025 (draft); Aug 20, 2025 (v1); Dec 31, 2025 update | Quantitative thresholds per metric | Malicious use; loss of control |
| Responsible AI Standard | Microsoft | 2022 (Standard v2); ongoing updates | Not capability-graded | Cross-cutting | Pre-deployment review by Office of Responsible AI; framework-level commitments via Seoul. |
| White House Voluntary Commitments | 7 (later 15) labs | Jul 21, 2023 | Not capability-graded | Biosecurity, cybersecurity, societal harm | Internal and external red-teaming before release; weight protection; watermarking research. |
| Frontier AI Safety Commitments | 16 labs (Seoul) | May 21, 2024 | Self-defined per signatory | Self-defined per signatory | Each signatory publishes thresholds it deems intolerable and commits not to deploy models that cross them without mitigation. |
METR's recurring "Common Elements of Frontier AI Safety Policies" review (published August 2024, March 2025, and December 2025) tracks how the active frameworks compare. The December 2025 update identified a strong convergence on five elements:
Divergences are equally important. Frameworks differ in how they define "meaningful uplift", whether they apply hard pauses (Anthropic v1 and v2) or softer delays (Anthropic v3, OpenAI streamlined), the granularity of the capability tiers (two levels vs four), the role of third-party evaluations, the treatment of post-training mitigations as substitutes for capability avoidance, and the scope of the AI R&D threshold (researcher productivity vs aggregate capability progress, the distinction Anthropic's RSP v3.1 made explicit). The Frontier Model Forum's 2025 to 2026 Technical Report Series (Risk Taxonomy and Thresholds, Frontier Capability Assessments, Frontier Mitigations, Third-Party Assessments, Managing Advanced Cyber Risks) is the most active attempt to converge the technical vocabulary across labs, with reports released between April 2025 and February 2026.
| Term | Meaning |
|---|---|
| Capability evaluation | Tests measuring whether a model has reached a defined dangerous capability, combining benchmark tasks, agentic harness runs, and red-teaming. |
| Pre-deployment evaluation | A capability evaluation on a model checkpoint before external release, often shared with the [[uk_aisi |
| Required Safeguards / Deployment Standard | Mitigations that must be in place before a model at a given capability level can be deployed. |
| Security Standard | Controls protecting model weights and training infrastructure, scaling with capability level. |
| Eval saturation | When standard benchmarks no longer discriminate between models because most score near the ceiling. |
| Sandbagging | A model strategically underperforming on evaluations. Named as a risk in OpenAI Preparedness Framework v2. |
| Capability elicitation | Getting a model to demonstrate true capability, via prompting, scaffolding, and tool access. |
| Autonomous replication and adaptation (ARA) | A model's capacity to acquire resources, copy itself, and adapt to new environments without human help. |
| AI R&D capability | A model's capacity to do frontier AI research, raising risk of a feedback loop accelerating AI development. |
| Pause and re-evaluate | A commitment to halt training or deployment if a model crosses a threshold without matching safeguards ready. Replaced by a softer delay pledge in Anthropic RSP v3.0. |
| Tracked Capability Level (TCL) | An early-warning capability level set below the corresponding CCL, introduced to DeepMind's FSF in April 2026. |
| Frontier Safety Roadmap | A public document, introduced in Anthropic RSP v3.0, describing how the lab plans to develop the security, alignment, safeguards, and policy work needed for higher ASLs. |
| Risk Report | A per-model document, introduced in Anthropic RSP v3.0, explaining how a model's capabilities, threat models, and active mitigations fit together. |
| Safeguards Report | A per-model document, introduced in OpenAI's streamlined Preparedness Framework, explaining safeguards design and verification. |
| FMF Technical Report Series | The Frontier Model Forum's 2025 to 2026 series of technical reports on risk taxonomy, capability assessments, mitigations, third-party assessments, and cyber risks. |
The specific capabilities that frontier developers test for under their RSP-style policies have converged across labs:
A few public events show what these frameworks look like in practice:
RSPs are voluntary, and this is the central source of criticism. Industry-set thresholds can be drawn wherever the developer prefers, and there is no neutral arbiter to confirm that a model has not crossed them. Several concerns recur:
RSP-style policies have shaped several pieces of formal or quasi-formal governance.
The [[bletchley_declaration|Bletchley Declaration]], signed at the UK AI Safety Summit on November 1 to 2, 2023 by 28 countries and the EU, recognised that frontier AI developers carry a particularly strong responsibility to test their systems for safety risks and share results. It did not impose specific thresholds, but it endorsed the conceptual structure of capability evaluations and pre-deployment testing.
The Frontier AI Safety Commitments, signed at the [[seoul_declaration|Seoul AI Safety Summit]] on May 21, 2024 by 16 frontier AI companies, asked each signatory to publish its safety framework, define thresholds for what it considers "intolerable" risk, and commit not to deploy models that cross those thresholds without mitigation. The Seoul commitments effectively asked every frontier lab to publish something resembling an RSP. METR's December 2025 review found twelve such frameworks in print.
In the United States, the California SB 1047 bill (the Safe and Secure Innovation for Frontier Artificial Intelligence Models Act) would have made some RSP-style commitments mandatory for "covered models" trained with more than $100 million in compute, or fine-tuned with more than $10 million. The bill required pre-training safety determinations, kill-switch mechanisms, audits, and a new Board of Frontier Models. It passed the California legislature in August 2024 and was vetoed by Governor Gavin Newsom on September 29, 2024. Newsom argued the bill focused too heavily on the largest models. The veto was supported by Meta, OpenAI, and House Speaker Nancy Pelosi, and opposed by 113 current and former employees of OpenAI, Google DeepMind, Anthropic, Meta, and xAI who signed a letter to the Governor in support.
A narrower successor, California SB 53 (the Transparency in Frontier Artificial Intelligence Act, or TFAIA), authored by Senator Scott Wiener, was signed by Governor Newsom on September 29, 2025. SB 53 applies to frontier developers whose foundation models are trained above 10^26 floating-point operations and whose annual revenues exceed $500 million, capturing roughly five to eight companies including OpenAI, Anthropic, Google DeepMind, Meta, and Microsoft. The law requires covered developers to publish a frontier AI framework explaining their plans for mitigating catastrophic risks, to file standardised safety incident reports, and to maintain whistleblower protections for employees. Penalties run up to $1 million per violation, enforceable by the California Attorney General. SB 53 is the first U.S. state law to formalise an RSP-style framework as a legal obligation rather than a voluntary commitment.
The [[eu_ai_act|EU AI Act]] entered into force on August 1, 2024. Its General-Purpose AI Code of Practice, published in July 2025 and entering into force on August 2, 2025, contains a Safety and Security chapter that applies only to providers of GPAI models with systemic risk, identified by a training-compute threshold of 10^25 floating-point operations. The chapter borrows directly from the RSP playbook: capability evaluations, mitigations indexed to risk levels, and incident reporting. OpenAI and Mistral are among the early signatories. The Code of Practice is partly responsible for the convergence of vocabulary across the major frameworks, since signing labs face pressure to align their published thresholds with the Code's structure.
The [[us_aisi|US AI Safety Institute]] (US AISI) and the [[uk_aisi|UK AI Security Institute]] (UK AISI) are the most active third-party evaluators of frontier models. Their joint evaluations of Anthropic's upgraded Claude 3.5 Sonnet (November 2024) and OpenAI's o1 (December 2024) are the clearest published examples of an RSP-style framework being checked by an outside party. The UK AISI also maintains Inspect, an open-source library for running model evaluations. The [[ai_safety_institute|AI Safety Institutes]] network broadened in 2024 and 2025 to include institutes in Japan, Singapore, Canada, the EU, and Korea, with a coordinated evaluation programme tied to the Seoul commitments.
The [[frontier_model_forum|Frontier Model Forum]] (FMF), founded in July 2023 by Anthropic, Google, Microsoft, and OpenAI, has become the main industry venue for converging RSP-style frameworks. Its Technical Report Series (Frontier Capability Assessments, April 2025; Risk Taxonomy and Thresholds, June 2025; Frontier Mitigations, June 2025; Third-Party Assessments, August 2025; Managing Advanced Cyber Risks, February 2026) is the most active attempt to codify shared technical practice across labs.
Several threads of work are likely to determine whether RSPs remain credible as models get more capable: