Frontier Safety Framework (Google DeepMind)
The Frontier Safety Framework (FSF) is a risk-evaluation and governance framework published by google deepmind, the artificial intelligence research division of google, that aims to identify and mitigate severe risks arising from advanced frontier AI models. First released on 17 May 2024, the framework establishes capability thresholds called Critical Capability Levels (CCLs) which, if reached without adequate safeguards, would correspond to a heightened risk of severe harm. The FSF specifies an evaluation cadence, a tiered scheme of security and deployment mitigations, and governance review mechanisms intended to ensure that mitigations are in place before any model exceeding a CCL is trained, internally deployed, or externally released.[^1][^2]
The FSF is widely regarded as DeepMind's counterpart to OpenAI's preparedness framework and to anthropic's responsible scaling policy (RSP), which were published in late 2023. Like those documents, it operationalises a "if-then" approach to dangerous-capability risk: stronger mitigations are required as models approach particular capability thresholds. Unlike Anthropic's RSP, however, the FSF was initially framed as a "recommended approach" and an "exploratory" first version rather than a set of binding commitments — a distinction that drew both criticism (for being non-binding) and praise (for explicit acknowledgement of uncertainty and an early focus on machine-learning research and development risks).[^1][^3][^11]
A second iteration (v2.0) was released on 4 February 2025, adding CCLs for machine-learning R&D and for deceptive alignment, strengthening the security-level recommendations, and codifying a more rigorous deployment-mitigation process. A third iteration (v3.0) followed on 22 September 2025, which introduced a CCL for harmful manipulation, expanded misalignment protocols, and broadened safety-case review to cover large-scale internal deployments. The FSF has been applied to evaluations of gemini 1.5, 2.0, gemini 2 5 pro, gemini 2 5 flash, and gemini 3.[^2][^4][^5][^7][^8]
Key facts
| Item | Detail |
|---|
| Full name | Frontier Safety Framework |
| Publisher | google deepmind |
| Version 1.0 published | 17 May 2024[^1] |
| Version 2.0 published | 4 February 2025[^4] |
| Version 3.0 published | 22 September 2025[^5] |
| Document type | Voluntary corporate risk-management framework |
| Core construct | Critical Capability Levels (CCLs) |
| Risk domains (v3) | CBRN, cybersecurity, ML R&D, harmful manipulation; plus exploratory misalignment / deceptive alignment[^5][^6] |
| Mitigation tiers | Security levels and deployment levels (graduated) |
| Initial v1.0 authors / leads | Anca Dragan, Helen King, Allan Dafoe[^1] |
| v3.0 announcement authors | Four Flynn, Helen King, Anca Dragan[^5] |
| Peer frameworks | OpenAI preparedness framework; Anthropic responsible scaling policy[^9][^11] |
| Application | Gemini 1.5 / 2.0 / 2.5 / 3 evaluations[^7][^8] |
Background and motivation
DeepMind's leadership, including chief executive demis hassabis, publicly warned through 2023 and 2024 that as frontier models scale they may acquire dangerous capabilities — including the ability to assist with the development of weapons of mass destruction, conduct autonomous cyberattacks, or accelerate AI research in ways that outpace governance capacity. These concerns paralleled wider international discussions captured in the bletchley declaration of November 2023 and the seoul declaration of May 2024, the latter of which urged frontier developers to publish "Frontier AI Safety Commitments" describing how they would identify and mitigate severe risks.[^10][^11]
The FSF was announced shortly after the Seoul AI Safety Summit at which sixteen leading developers — including Google — committed to publishing safety frameworks. Internally, DeepMind had been developing dangerous-capability evaluations under its AGI Safety & Alignment team since at least 2022, building on earlier work such as the "Levels of AGI" taxonomy (Morris et al., 2023), which proposed a graded scale for general-purpose AI competence and was frequently cited in subsequent FSF materials as a conceptual underpinning for capability thresholds.[^1][^12]
DeepMind explicitly positioned the FSF as complementary to Google's pre-existing AI Principles and to "Google's existing suite of AI responsibility and safety practices" rather than as a replacement. The framework was also designed to be revisable: the May 2024 announcement noted that the document was "exploratory" and would "evolve substantially" as understanding of frontier risks improved, with full implementation targeted for early 2025.[^1]
The FSF's intellectual lineage draws on several strands of prior work. Within DeepMind, the AGI Safety & Alignment team had been publishing technical safety research on specification gaming, reward hacking, scalable oversight, and dangerous-capability evaluation methodologies for several years. Externally, the framework's structure echoes the "if-then" logic of biosafety levels in life-sciences laboratories, which classify research according to organism hazard and require correspondingly stronger physical and procedural containment — a parallel made explicit by Anthropic's responsible scaling policy and acknowledged by DeepMind in its discussion of graduated security levels.[^9][^16] The FSF also reflects the broader policy environment created by the November 2023 UK AI Safety Summit at Bletchley Park, which produced the bletchley declaration signed by 28 governments and the European Union, and by parallel work in the frontier model forum, an industry body co-founded by Google, OpenAI, Anthropic, and Microsoft to share frontier-safety best practices.[^10][^16]
Version 1.0 (May 2024)
Version 1.0 of the FSF, published on 17 May 2024 with an accompanying blog post by Anca Dragan, Helen King, and Allan Dafoe, organised the framework around three components:[^1]
- Identifying capabilities with severe-harm potential. DeepMind defined Critical Capability Levels as "capability levels at which, absent mitigation measures, frontier AI models or systems may pose a heightened risk of severe harm." CCLs were derived by enumerating foreseeable harm pathways and identifying the minimal set of capabilities required to realise them.
- Periodic early-warning evaluations. Models would be re-evaluated each time effective compute increased by approximately a factor of six and after each roughly three-month interval of fine-tuning. "Early warning evaluations" were intended to flag when models approached a CCL with sufficient safety buffer to apply mitigations before the CCL itself was reached.[^1][^13]
- Mitigation plans. When a CCL was approached or reached, DeepMind would apply a tailored combination of security and deployment mitigations. Security mitigations were intended to reduce the risk of model-weight exfiltration; deployment mitigations were intended to constrain or monitor the expression of critical capabilities once a model was used.
Version 1.0 specified four initial risk domains for CCLs:
- Autonomy — capabilities that would allow a model to operate persistently without human direction, for example by acquiring resources or replicating across infrastructure.
- Biosecurity (often referenced as the "Bio" domain) — capabilities that would meaningfully uplift non-expert actors in producing biological threats.
- Cybersecurity — capabilities that would substantially increase the ability of threat actors to execute high-impact cyberattacks.
- Machine learning R&D — capabilities that could accelerate AI development in ways with destabilising effects.[^1][^13]
The framework defined a graduated set of security levels and deployment levels keyed to CCL severity. As described in the v1.0 technical report and contemporaneous summaries, "higher level security mitigations result in greater protection against the exfiltration of model weights," while higher deployment levels enable tighter management of critical capabilities through such measures as access controls, safety fine-tuning, classifier-based filters, and monitoring.[^1][^13]
A widely noted feature of v1.0 was the language used to characterise commitments. DeepMind described the framework as a "set of protocols" and a "recommended approach" rather than a set of contractual commitments. The blog launch acknowledged that "even though these risks are beyond the reach of present-day models, we hope that implementing and improving the framework will help us prepare to address them."[^11]
Version 2.0 (February 2025)
Version 2.0 of the FSF was published on 4 February 2025, with an accompanying blog post titled "Updating the Frontier Safety Framework."[^4] Lead authorship on the technical document was attributed to Lewis Ho, Celine Smith, Claudia van der Salm, Joslyn Barnhart, and Rohin Shah, with senior contributions from Allan Dafoe, Anca Dragan, Andy Song, and Demis Hassabis among others.[^4]
Major changes in v2.0 included:
- Security level recommendations. Each CCL was paired with a recommended security level (numbered 2 through 4), giving operators a concrete mapping between capability profile and required protection of model weights. The framework recommended particularly high security levels for ML R&D CCLs, on the reasoning that uncontrolled proliferation of those capabilities could most directly undermine collective oversight of further AI development.[^4][^9]
- More rigorous deployment-mitigation process. A multi-step procedure was codified for misuse-risk CCLs, including iterative safeguard development, the construction of an explicit safety case, governance review, and post-deployment monitoring. The blog post described "a more rigorous safety mitigation process to models reaching a CCL in a misuse risk domain."[^4]
- Explicit deceptive-alignment treatment. A new misalignment-oriented capability area was added, focused on the risk that "an autonomous system [could] deliberately undermine human control." DeepMind's initial approach centred on detecting baseline instrumental-reasoning ability and using automated monitoring to detect illicit use of such reasoning.[^4]
- Expanded ML R&D CCLs. The ML R&D domain was elaborated into multiple sub-levels addressing both uplift (a model substantially accelerating AI research) and autonomy (a model capable of running portions of the AI R&D pipeline end-to-end). The framework specifically identified the scenario in which a model "fully automates the AI R&D pipeline at a competitive cost" as a salient threshold.[^6]
- Updated misuse CCLs. The v2.0 document listed misuse-domain CCLs including "CBRN Uplift 1" (uplift to low-resourced actors for dual-use scientific protocols), "Cyber Uplift Level 1" (uplift for well-resourced threat actors against critical infrastructure), and "Cyber Autonomy Level 1" (end-to-end automation of cyberattacks against limited-security organisations).[^6]
The v2.0 document also reiterated that the framework was a living set of recommendations and that its measures should ideally be adopted across the broader frontier-developer community, noting that "the social value of any single actor's security mitigations will be significantly reduced if not broadly applied across the field."[^4]
Version 3.0 (September 2025)
The third major iteration was published on 22 September 2025 with an accompanying blog post, "Google DeepMind strengthens the Frontier Safety Framework," authored by Four Flynn, Helen King, and Anca Dragan.[^5] Key changes from v2.0 included:
- A new CCL for harmful manipulation. Version 3 added a CCL covering models "with powerful manipulative capabilities that could be misused to systematically and substantially change beliefs and behaviors in identified high stakes contexts." The blog characterised this as building on years of research into the mechanics of persuasion and described the underlying assessment methodology as "exploratory and subject to further research."[^5][^14]
- Expanded misalignment protocols. The misalignment treatment was widened to cover scenarios in which a model resists modification or shutdown, with safeguard procedures designed to maintain operator control over models that might develop instrumental motivations to evade oversight.[^5][^14]
- Broader safety-case review. Version 3 extended the requirement for governance-reviewed safety cases to large-scale internal deployments — not only external releases — recognising that internal applications of frontier models inside Google's own research and product workflows could themselves produce severe risk under some conditions.[^5]
- Sharper CCL definitions and risk-assessment process. The document refined CCL definitions to better separate routine operational concerns from threats meeting the "severe harm" criterion and added an explicit risk-acceptability determination step.[^5]
A subsequent update on 17 April 2026 added the concept of Tracked Capability Levels (TCLs) — sub-CCL thresholds intended to flag less extreme risks earlier in the development cycle than full CCLs.[^14]
The post-v3 domain set comprises four formal risk areas (CBRN, cybersecurity, ML R&D, and harmful manipulation) plus an exploratory approach for misalignment, including deceptive alignment.[^5]
Critical Capability Levels in detail
Autonomy
In v1.0 the autonomy domain captured capabilities that would let a model operate persistently and acquire resources without human direction. Illustrative thresholds discussed in commentary on the v1.0 document included models able to "self-replicate across rented hardware" or to maintain themselves financially through autonomous activity.[^13] In v2.0 the autonomy concerns were partially absorbed into the misalignment / deceptive-alignment domain and into ML R&D autonomy levels, while in v3.0 they overlap with the new shutdown-resistance protocols.[^4][^5]
Cybersecurity
The cyber domain in v2.0 distinguished "uplift" levels — additional capability conferred by an AI system to a human attacker — from "autonomy" levels, where the model itself executes an end-to-end attack. "Cyber Uplift Level 1" addressed uplift to well-resourced attackers against critical infrastructure, while "Cyber Autonomy Level 1" addressed full automation of attacks against organisations with limited security.[^6]
CBRN / Biosecurity
Initially framed as a biosecurity domain in v1.0, the misuse-from-scientific-knowledge category was broadened in v2.0 to address chemical, biological, radiological, and nuclear (CBRN) information risks. "CBRN Uplift 1" was defined as the level at which a model can meaningfully assist low-resourced actors with dual-use scientific protocols — a threshold motivated by the concern that diffusion of CBRN knowledge to non-experts is a particularly poorly-bounded class of harm.[^6]
ML R&D
Version 2.0's ML R&D domain was elaborated into uplift and autonomy CCLs. "ML R&D Uplift Level 1" referred to a model that could accelerate AI development substantially — described in commentary on the document as roughly equivalent to compressing the 2020–2024 pace of progress by a factor of two. "ML R&D Autonomy Level 1" denoted a model that could fully automate the AI R&D pipeline at competitive cost.[^6] DeepMind explicitly recommended the highest security level (Level 4) for these CCLs on the reasoning that uncontrolled diffusion of recursive-self-improvement capabilities would be among the most consequential failure modes for the field.[^4][^9]
Deceptive alignment and misalignment
Version 2.0 introduced a misalignment-oriented capability area structured around Instrumental Reasoning Levels. Instrumental Reasoning Level 1 corresponded to enough situational awareness and instrumental-reasoning capacity that a model could in principle undermine human control unless safeguards were in place; Instrumental Reasoning Level 2 corresponded to capabilities sufficient to evade detection under standard monitoring.[^6] In v3.0 these were complemented by shutdown-resistance considerations.[^5]
Harmful manipulation
Added in v3.0, the harmful manipulation CCL targets models that could, across the course of an interaction, change user beliefs and behaviours in ways that produce expected harm at severe scale in identified high-stakes contexts (for example, large-scale political persuasion or coercive influence). DeepMind described this domain as research-stage and indicated continued work on measurement methodology.[^5][^14]
Mitigation strategies
The FSF's mitigations are grouped into two families:
- Security mitigations are designed to prevent unauthorised access to model weights — in particular, to prevent state-level or well-resourced adversaries from exfiltrating a model whose deployment is subject to controls. Security levels are graduated, with Security Level 4 (the highest tier in the v2.0 / v3.0 schema) representing protections approaching state-of-the-art secure-facility controls. ML R&D autonomy CCLs are linked to Security Level 4 in v2.0.[^4][^9]
- Deployment mitigations are designed to constrain or monitor critical capabilities in models that are released either externally or internally. Mitigations include safety fine-tuning to refuse misuse requests, classifier-based filters that detect attempts to elicit dangerous behaviour, structured access patterns, capability shaping to remove dangerous knowledge without degrading general utility, and human oversight of high-risk usage.[^1][^4]
Version 2.0 added an explicit deployment-mitigation procedure consisting of (i) iterative safeguard development, (ii) a safety case justifying that mitigations adequately reduce residual risk, (iii) governance review by an "appropriate governance function," and (iv) post-deployment monitoring.[^4] Version 3.0 extended the safety-case requirement to large-scale internal deployments.[^5]
Application to Gemini models
The FSF is the operating framework under which DeepMind has published model card and dedicated "FSF report" documents accompanying its gemini family of models.
- Gemini 1.5 — Evaluated under v1.0; the published evaluations focused on autonomy, biosecurity, and cyber uplift early-warning measures.[^15]
- Gemini 2.0 / Gemini 2.5 — Evaluated under v2.0. Documents accompanying Gemini 2.5 Pro and Gemini 2.5 Deep Think report that "across all areas covered by the Frontier Safety Framework, Critical Capability Levels (CCLs) have not been reached for Gemini 2.5 Pro," though the model crossed the early-warning thresholds for Cyber Uplift Level 1 and CBRN Uplift Level 1, triggering higher-frequency follow-up testing as the FSF's response plan envisaged.[^7][^15]
- Gemini 3 Pro — Evaluated under v3.0; the November 2025 Gemini 3 Pro FSF Report concluded that no critical alert thresholds were crossed beyond those already reached by Gemini 2.5 Pro. The cyber evaluation showed a notable performance jump (passing 11 of 12 "hard" challenges versus 6 of 12 for the prior generation), CBRN open-ended responses showed "high scientific accuracy but low novelty" with minimal wet-lab uplift, ML R&D scores remained below alert thresholds, and deceptive-alignment evaluations reported "substantial propensity for strategic deception in certain limited circumstances" combined with insufficient stealth and situational awareness for severe real-world harm.[^8]
Reception and comparison to peers
The FSF is part of a cluster of voluntary frameworks published by leading frontier developers in 2023–2025, including OpenAI's preparedness framework (December 2023; updated 2024–2025) and Anthropic's responsible scaling policy (September 2023; updated multiple times). Comparative analyses by metr, the Frontier Model Forum, and independent researchers have identified both convergences and divergences.[^9][^16]
Common ground identified across all three frameworks includes:
- The use of capability thresholds (called Critical Capability Levels by DeepMind, "High" and "Critical" thresholds by OpenAI, and AI Safety Levels by Anthropic) to gate stronger safeguards.
- Coverage of broadly the same risk domains: biological/CBRN misuse, cyberoffense, and AI R&D acceleration.
- Pre-deployment dangerous-capability evaluations and red-team testing of mitigations.
- Periodic re-evaluation tied to capability scaling.[^9][^16]
Distinctive features of the FSF, compared with its peers, include:
- Explicit inclusion of misalignment and deceptive alignment as formal threat models from v2.0 onward — a feature that several reviewers noted goes further than the OpenAI framework's initial scope.[^9]
- Harmful manipulation added as a CCL in v3.0, before the same domain was formalised by some peer frameworks.[^5][^16]
- Graduated security levels recommended at the level of the entire field, reflecting DeepMind's view that uniform diffusion is required for security measures to be socially useful.[^4]
- A safety-case procedure that, by v3.0, applies both to external deployments and to large-scale internal use.[^5]
Criticism of the FSF has paralleled criticism of the OpenAI and Anthropic frameworks. Independent reviewers — including writers on the EA Forum, the Machine Intelligence Research Institute, and various commentators — have argued that:
- The framework's language of "recommended approach" and its qualified commitments, particularly in v1.0, fall short of binding constraints, and that the framework is therefore more transparency mechanism than enforcement mechanism.[^11][^17]
- Mitigation thresholds remain under-specified relative to the seriousness of the risks they aim to address, and that current jailbreaking and elicitation techniques routinely defeat deployment-stage safeguards within days of model release.[^17]
- The frameworks broadly underweight the possibility that important threat models have been missed: in contrast to mature safety-critical industries, frontier AI risk management generally lacks a structural assumption that something important has been overlooked.[^17]
Supportive responses have emphasised the FSF's relative early coverage of ML R&D risks and deceptive alignment, its explicit articulation of governance-reviewed safety cases, and the published Gemini FSF reports as concrete artefacts demonstrating that the framework is being operated, not merely declared.[^9][^16]
External regulatory developments have also shaped the FSF's significance. The eu ai act (entered into force 2024) imposes systemic-risk obligations on general-purpose AI models above defined compute thresholds, and California's SB 53 (2025) creates statutory obligations for frontier developers to publish, implement, and comply with frontier-AI safety frameworks. Commentators have noted that voluntary frameworks such as the FSF are increasingly intersecting with, and may be partially absorbed by, this regulatory layer.[^9]
Governance and process
The FSF embeds its mitigation choices in a defined governance process. From v2.0 onward, the framework specifies that decisions about whether mitigations are sufficient — and therefore whether a model should be trained further, deployed externally, or deployed internally at scale — must pass through an "appropriate governance function" within Google DeepMind, which the document does not name individually but describes in terms of role and authority. The governance function reviews:
- The dangerous-capability evaluation results.
- The associated safety case justifying that mitigations reduce residual risk to acceptable levels.
- Plans for ongoing post-deployment monitoring.[^4][^5]
When evaluations indicate that a model is approaching or has crossed an early-warning threshold for a CCL, the framework requires a structured response plan. The Gemini 2.5 Pro materials describe such a response — including higher-frequency follow-up testing and additional mitigations — in concrete terms following observed approach to the Cyber Uplift Level 1 and CBRN Uplift Level 1 thresholds.[^7] The FSF also envisages collaboration with external evaluators, who in the case of Gemini 3 Pro provided independent CBRN wet-lab testing and additional adversarial red-team probing of deceptive-alignment behaviour.[^8]
DeepMind has framed the publication of FSF reports alongside model card documents as the principal transparency mechanism. Each FSF report summarises which CCLs were evaluated, which evaluation suites were used, how results compared against early-warning and alert thresholds, and what mitigations were applied. Together with the system-card series, this approach is intended to give external researchers, governments, and other developers a documented basis for assessing the safety claims associated with each model.[^7][^8]
Relationship to Anthropic's RSP and OpenAI's Preparedness Framework
Comparison with peer frameworks helps clarify both the FSF's contribution and its limitations.
anthropic's responsible scaling policy (first published September 2023) uses a tiered scheme of AI Safety Levels (ASL-1 through ASL-4 and beyond), patterned on biosafety levels. ASL-3 introduces requirements such as a "commitment not to deploy if catastrophic misuse risk is evident under adversarial testing," a defence-in-depth deployment standard, and a security standard validated by independent third-party audit. The RSP has been updated repeatedly and is administered through a Responsible Scaling Officer reporting to Anthropic's CEO and board.[^9]
openai's preparedness framework (first published December 2023) defines Tracked Risk Categories — initially biological, cybersecurity, autonomous replication, and AI self-improvement — and rates models against "Low," "Medium," "High," and "Critical" capability thresholds in each category. The framework commits OpenAI to not deploy models that reach Critical capability levels without strong mitigations, governance review by a Safety Advisory Group, and explicit board-level escalation paths for safety decisions.[^9]
The FSF's central construct, the Critical Capability Level, sits somewhere between Anthropic's biosafety-style levels and OpenAI's tracked-category thresholds. Like OpenAI's framework, it organises CCLs by domain (CBRN, cyber, ML R&D, manipulation, misalignment) rather than by a single global level. Like Anthropic's RSP, it pairs each capability threshold with prescribed security and deployment mitigations. Distinctively, it formalises ML R&D as a top-tier risk requiring Security Level 4 mitigations from v2.0, includes deceptive-alignment as a formal capability area, and from v3.0 includes harmful manipulation as a CCL.[^4][^5][^9]
Despite these structural differences, independent reviewers including METR and the Frontier Model Forum have judged the three frameworks to be substantially convergent in their underlying logic: each pre-commits to specific kinds of evaluation, defines (with varying precision) the capability thresholds that would trigger stronger safeguards, and creates a documented process for deciding whether a model should be deployed. The principal practical question is one of stringency — how much capability is enough to trigger which mitigation — and of binding force, since all three frameworks remain voluntary corporate commitments rather than legally enforceable instruments in most jurisdictions.[^9][^16]
References