# Frontier models

> Source: https://aiwiki.ai/wiki/frontier_models
> Updated: 2026-06-21
> Categories: AI Policy & Regulation, AI Safety, Artificial Intelligence, Large Language Models
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Frontier models** are the most advanced [artificial intelligence](/wiki/artificial_intelligence) models that represent the cutting edge of AI capabilities at any given time. These highly capable [foundation models](/wiki/foundation_models), generally implemented as very [large language models](/wiki/large_language_model), push the boundaries of what is possible with AI technology, typically characterized by their massive scale, multimodal capabilities, and ability to perform a wide variety of complex tasks across different domains.[^1][^2] The term encompasses both foundation models and general-purpose AI systems that exceed the capabilities of existing models and often require significant computational resources to develop and deploy, typically exceeding 10^25 floating-point operations (FLOPs) during training.[^3][^4]

In regulatory practice the label is anchored to a quantitative compute threshold: a training run of 10^25 FLOPs, the marker adopted by the European Union's AI Act for "general-purpose AI models with systemic risk." [Epoch AI](/wiki/epoch_ai) counted 24 models above this threshold at the end of 2024, roughly two new ones crossing it every month during that year, the first being [GPT-4](/wiki/gpt-4) in March 2023.[^3][^4] As of mid-2026, models widely regarded as sitting at the frontier include [Anthropic](/wiki/anthropic)'s Claude Opus 4.8, [OpenAI](/wiki/openai)'s [GPT-5.5](/wiki/gpt-5-5), [Google DeepMind](/wiki/google_deepmind)'s [Gemini 3.1 Pro](/wiki/gemini_3_pro), [xAI](/wiki/xai)'s [Grok 4.3](/wiki/grok_4), and open-weight challengers such as [DeepSeek](/wiki/deepseek) V4 and Alibaba's Qwen3.7-Max.

## What is a frontier model?

### Core definition

The concept of frontier models was formally introduced in a 2023 paper led by Markus Anderljung and co-authored by researchers at OpenAI, Google DeepMind, and the Centre for the Governance of AI, titled "Frontier AI Regulation: Managing Emerging Risks to Public Safety," where they are defined as "highly capable foundation models that could possess dangerous capabilities sufficient to pose severe risks to public safety."[^5] The authors gave concrete examples of the kinds of capabilities that would meet this bar, writing that they "may include significantly enabling the acquisition of weapons of mass destruction, exploiting vulnerabilities in safety-critical software systems, synthesising persuasive disinformation at scale, or evading human control."[^5] This definition highlights three key challenges: dangerous capabilities can emerge unexpectedly during development or post-deployment; it is difficult to prevent misuse of deployed models; and model capabilities can proliferate rapidly, especially through open-source releases.

The United Kingdom government, a leading proponent of the term, defines frontier AI as "highly capable general-purpose AI models that can perform a wide variety of tasks and match or exceed the capabilities present in today's most advanced models."[^6] The UK's analysis emphasizes that frontier models are increasingly multimodal and may be augmented into autonomous [AI agents](/wiki/ai_agents) with tool use (for example web browsing and code execution) that expand their real-world impact.[^7]

The [Frontier Model Forum](/wiki/frontier_model_forum), an industry body established by leading AI companies, defines frontier models as "large-scale machine-learning models that exceed the capabilities currently present in the most advanced existing models, and can perform a wide variety of tasks."[^8]

### Origin of the term

The phrase "frontier AI" gained currency through the UK government's policy work leading up to the November 2023 AI Safety Summit at Bletchley Park. Prior to that summit, UK officials used the term in consultations and white papers to distinguish the most capable general-purpose models from narrower AI systems. The term was simultaneously taken up by OpenAI and [Anthropic](/wiki/anthropic) in published [AI safety](/wiki/ai_safety) research and company policy documents.

The word "frontier" is borrowed from the economics literature on production possibilities and technological frontiers, where it describes the outer boundary of what is technically achievable. Applied to AI, the frontier is not a fixed line but a moving one: a model that sits at the frontier today becomes the baseline against which future systems are measured. This dynamism is central to how governments and researchers use the term, because any fixed capability threshold will eventually be crossed by a wider range of systems as the field advances.

The Frontier Model Forum (FMF), established in July 2023 by Anthropic, Google, Microsoft, and OpenAI, gave the terminology institutional weight and helped anchor it across industry, policy, and academic discourse.

### Key characteristics

Frontier models typically exhibit several defining characteristics:

| Characteristic | Description | Implication |
| --- | --- | --- |
| Massive scale and compute | Trained using vast computational resources, often exceeding 10^25 FLOPs, with parameter counts in the hundreds of billions or trillions, and training costs ranging from tens to hundreds of millions of dollars | High development costs create barriers to entry, concentrating power in well-funded labs. Scale is a primary driver of advanced capabilities |
| Multimodal capabilities | Ability to process and generate multiple types of data including text, images, audio, and video | Enables broad applicability across domains but complicates safety evaluation and control |
| Emergent properties | Display abilities that were not explicitly programmed, such as complex reasoning, chain-of-thought reasoning, code generation, or creative writing | Makes pre-deployment risk assessment challenging. The full capabilities may not be understood until widely deployed[^9] |
| General-purpose functionality | Designed to be adaptable across a wide range of domains and tasks with minimal fine-tuning | Enormous utility but difficult to predict all possible uses, including malicious applications |
| Extended context windows | Modern frontier models can process extensive amounts of information, with some models supporting context windows of over 1 million tokens | Enables complex, long-form tasks but increases potential for sophisticated manipulation[^10] |
| Autonomy potential | Advanced models show increasing ability to act autonomously to achieve goals, using tools, accessing external information, and self-correcting | Raises long-term safety concerns about control and alignment with human values |

## History and development

### Evolution of the term

The history of frontier models is intertwined with the evolution of [foundation models](/wiki/foundation_models), which emerged in the late 2010s. The term "foundation model" was coined in 2021 by researchers at the Stanford Institute for Human-Centered Artificial Intelligence (HAI) to describe large-scale models trained on broad data using self-supervision.[^11]

The first models trained on over 10^23 FLOPs were [AlphaGo](/wiki/alphago) Master and AlphaGo Zero, developed by DeepMind and published in 2017.[^12] The progression to current frontier models proceeded rapidly through successive generations of transformer-based language models.

### Timeline of major developments

| Date | Milestone | Significance |
| --- | --- | --- |
| 2017 | AlphaGo Master/Zero | First models exceeding 10^23 FLOPs |
| 2018 | [BERT](/wiki/bert) (Google) | Demonstrated power of transformer-based language models |
| 2019 | [GPT-2](/wiki/gpt-2) (OpenAI) | Raised concerns about misuse, leading to staged release |
| 2020 | [GPT-3](/wiki/gpt-3) (OpenAI) | 175 billion parameters, sparked widespread interest in large language models |
| March 2023 | [GPT-4](/wiki/gpt-4) (OpenAI) | First model widely acknowledged as exceeding 10^25 FLOPs, demonstrated significant capability leap[^13] |
| May 2023 | Joint AI Safety Statement | CEOs of major AI labs declared extinction risk from AI a global priority[^14] |
| July 2023 | Frontier Model Forum launched | Industry coordination body established by Anthropic, Google, Microsoft, and OpenAI[^15] |
| July 2023 | White House voluntary commitments | Seven leading AI companies committed to safety, red-teaming, and information-sharing with the US government |
| October 2023 | U.S. Executive Order 14110 | Established compute-based reporting requirements for frontier models[^16] |
| November 2023 | Bletchley Declaration | 28 countries agreed on frontier AI risks and cooperation at the UK AI Safety Summit[^17] |
| 2024 | EU AI Act finalized | Establishes regulations for "general-purpose AI models with systemic risk"[^18] |
| May 2024 | AI Seoul Summit | Launched the AI Safety Institute International Network; Amazon and Meta joined the FMF |
| August 2025 | [GPT-5](/wiki/gpt-5) (OpenAI) | Major capability milestone; state-of-the-art across coding, math, and scientific reasoning |
| February 2026 | Anthropic RSP v3.0 | Substantially updated Responsible Scaling Policy with new transparency and accountability measures |
| April 2026 | Claude Mythos preview (Anthropic) | Announced April 7, 2026; restricted-release model a capability tier above Opus 4.7, withheld from general availability over offensive-cybersecurity risk[^40] |
| April 2026 | Muse Spark (Meta) | Released April 8, 2026; first model from Meta Superintelligence Labs, a closed-source departure from the Llama family[^41] |
| April 2026 | [GPT-5.5](/wiki/gpt-5-5) (OpenAI) | Released April 23, 2026; improved agentic capabilities and scientific reasoning[^34] |
| April 2026 | [Claude Opus 4.7](/wiki/claude_opus_4_7) (Anthropic) | Released April 16, 2026; high-resolution image support, 1M context, and task budgets[^35] |
| May 2026 | [DeepSeek](/wiki/deepseek) V4 / Qwen3.7-Max | Open-weight DeepSeek V4 (April 24) and Alibaba's Qwen3.7-Max (May 19-20) place Chinese labs in the global top tier[^37][^43] |
| May 2026 | Claude Opus 4.8 (Anthropic) | Released May 28, 2026; current Anthropic flagship, adding effort controls and a faster, cheaper mode[^36] |
| August 2026 | EU AI Act enforcement | European Commission gains powers to fine GPAI providers (up to 3% of global turnover or EUR 15 million) from August 2, 2026[^42] |

## Frontier Model Forum

The Frontier Model Forum (FMF) was established in July 2023 by Anthropic, Google, Microsoft, and OpenAI as the primary industry coordination body for frontier AI safety. Amazon and Meta joined in May 2024, expanding the membership to six of the world's largest AI developers.

| Aspect | Details |
| --- | --- |
| Current members (2026) | Amazon, Anthropic, Google, Meta, Microsoft, OpenAI[^27] |
| Key objectives | AI safety research, best practices, policy collaboration, societal applications |
| AI Safety Fund | $10 million for independent safety research, managed directly by FMF from June 2025 |
| Executive Director | Chris Meserole (appointed October 2023) |
| Focus areas | CBRN risks, cyber capabilities, societal impacts, evaluation standards |

In its founding announcement the four launch members described the Forum as "an industry body focused on ensuring safe and responsible development of frontier AI models," defining frontier AI in its terminology as "those general purpose AI models that constitute the state of the art, a collection which will shift over time as the field progresses."[^8][^15]

In March 2025 the FMF announced a first-of-its-kind information-sharing agreement enabling member firms to exchange information about vulnerabilities, threats, and capabilities of concern unique to frontier AI. The Forum intends to pilot voluntary information-sharing with non-FMF frontier AI companies as well. In January 2026 the FMF published work on "Chain of Thought Monitorability," examining techniques for verifying the faithfulness of reasoning traces in large language models.

The FMF's AI Safety Fund has supported multiple cohorts of independent safety researchers. After the Meridian Institute announced in June 2025 that it would be winding down its operations, the FMF began managing the fund directly, issuing grants in areas including biosecurity, cybersecurity, AI agent evaluation, and synthetic content detection.

## What compute threshold defines a frontier model?

### The 10^25 FLOPs threshold

The most widely used quantitative marker for a frontier model is a training compute of 10^25 floating-point operations (FLOPs). This threshold has been adopted in several regulatory contexts, including the EU AI Act and the US Executive Order on AI, because it correlates approximately with the point at which models have shown the ability to acquire novel dangerous capabilities, though the correspondence is imperfect and contested.

The threshold captures several practical realities. Models trained at or above 10^25 FLOPs require specialized data center infrastructure, large GPU or TPU clusters, and training budgets typically exceeding $50 million. These resource requirements mean that relatively few organizations can train such systems, making compute a tractable lever for oversight. By contrast, models below this threshold can be reproduced by smaller research organizations or derived from open weights, making supply-side compute controls less effective for them.

As of 2025, research from the Governance AI Centre estimated that approximately 23 models exceeded 10^25 FLOPs at the end of 2024, with projections suggesting between 103 and 306 models could cross this threshold by 2028. This rapid proliferation means that compute-threshold-based regulation will need to be recalibrated over time to remain meaningful.[^4]

The EU AI Act distinguishes two tiers of compute:

- **GPAI models (general-purpose AI):** Trained with over 10^23 FLOPs. Subject to transparency and copyright compliance obligations.
- **GPAI models with systemic risk:** Trained with over 10^25 FLOPs. Subject to additional obligations including model evaluation, incident reporting, cybersecurity requirements, and mandatory notification to the EU AI Office.

### Limitations of compute as a proxy

Researchers and policymakers have noted several limitations of relying exclusively on training compute as a threshold. Algorithmic improvements, more efficient architectures (such as mixture-of-experts), and longer post-training processes (including reinforcement learning from human feedback and reinforcement learning on verifiable outcomes) can produce highly capable models at lower pre-training compute costs. Reasoning training compute in particular has grown roughly tenfold every three to five months, far outpacing the four-to-five times annual growth in pre-training compute, and a raw sum of pre-training and post-training compute may become a progressively worse proxy for model capabilities as training recipes evolve.[^32] This complicates the historical [scaling laws](/wiki/scaling_laws) relationship in which capability gains tracked smoothly with increases in pre-training compute, parameters, and data.

Some researchers advocate for capability-based thresholds instead of or alongside compute thresholds, measuring whether a model can perform specific dangerous tasks rather than how much compute was used to train it. This approach aligns with the UK [AI Safety Institute](/wiki/ai_safety_institute)'s evaluation methodology and with the METR time-horizon framework.

## How are frontier models regulated?

### Bletchley Declaration (2023)

The Bletchley Declaration, signed at the AI Safety Summit at Bletchley Park on November 1-2, 2023, was the first international agreement to formally define and address frontier AI risks. It was signed by 28 countries including the United States, China, the European Union, and nations from Africa, the Middle East, and Asia.

The Declaration defined frontier AI as "highly capable general-purpose AI models that can perform a wide variety of tasks and match or exceed the capabilities present in today's most advanced models." It noted that frontier AI presents substantial risks from "potential intentional misuse or unintended issues of control," with particular concern in the domains of cybersecurity and biotechnology.

Key commitments in the Declaration included:

- A mutual understanding of frontier AI risks and the need to act collectively
- Support for an independent and inclusive "State of the Science" report on frontier AI, led by Yoshua Bengio
- Commitment to working together in an inclusive manner toward human-centric, trustworthy, and responsible AI development
- Agreement for leading AI companies to allow governments to test frontier models before public release, with governments sharing results and collaboratively developing safety standards

### White House voluntary commitments (2023)

On July 21, 2023, the Biden administration secured voluntary commitments from seven leading AI companies: Amazon, Anthropic, Google, Inflection AI, Meta, Microsoft, and OpenAI. A second set of commitments from eight additional companies including IBM, Nvidia, and Palantir followed in September 2023.

The commitments addressed safety practices considered appropriate for frontier models specifically. Companies committed to:

- Internal and external red-teaming of models in areas including misuse, societal risks, and national security concerns prior to deployment
- Sharing safety information with governments, civil society, and academia, including standards from the NIST AI Risk Management Framework
- Publishing transparency reports on model capabilities and limitations
- Treating model weights as core intellectual property and investing in security protections
- Developing technical mechanisms to watermark AI-generated audio and visual content

### European Union AI Act

The EU AI Act, which took effect in August 2025, establishes the most detailed regulatory regime currently applicable to frontier models. It uses the term "general-purpose AI models with systemic risk" (GPAISR) rather than "frontier models" but covers the same set of systems.

| Aspect | Requirement | Details |
| --- | --- | --- |
| Compute threshold | 10^25 FLOPs | Models exceeding this threshold are presumed to have systemic risk[^19] |
| GPAI baseline threshold | 10^23 FLOPs | All GPAI models above this threshold face transparency and copyright obligations |
| Model evaluations | Mandatory testing | Standardized protocols and adversarial testing required |
| Risk assessment | Systemic risk evaluation | Must assess and mitigate potential societal-scale risks |
| Incident reporting | Report to AI Office | Serious incidents must be reported to the EU AI Office |
| Cybersecurity | Adequate protections | Ensure model and weight protection against misuse |
| Documentation | Technical documentation | Comprehensive documentation for downstream providers[^20] |
| Compliance deadline | August 2, 2025 | Full implementation required for covered models; models on market before this date have until August 2027 |
| Notification | Within two weeks | Providers must notify the European Commission's AI Office within two weeks of crossing the 10^25 FLOPs threshold |

On July 10, 2025 the European AI Office published the final General-Purpose AI Code of Practice, a voluntary instrument drafted by independent experts that gives providers a presumption-of-conformity route to meeting the transparency, copyright, and safety-and-security obligations; alongside it the Commission issued guidelines clarifying how the training compute threshold should be calculated and what documentation requirements apply to open-weight and open-source models. The obligations for GPAI providers became applicable on August 2, 2025, and from August 2, 2026 the Commission gains powers to enforce them, including fines of up to 3 percent of global annual turnover or EUR 15 million, whichever is higher, under Article 101.[^42] Separately, a "Digital Omnibus" simplification package amending parts of the AI Act was adopted by the Commission on November 19, 2025, with a political agreement reached on May 7, 2026 that, among other changes, extends the transition period for certain high-risk AI systems to August 2, 2028.

### United States framework

The U.S. approach has centered on Executive Order 14110 on Safe, Secure, and Trustworthy AI, issued in October 2023. The Executive Order used a threshold of 10^26 FLOPs (one order of magnitude above the EU threshold) for the most intensive reporting requirements, while also establishing requirements for biological sequence models above 10^23 FLOPs due to biosecurity concerns.

| Component | Threshold | Requirements |
| --- | --- | --- |
| Dual-use foundation models | More than 10^26 FLOPs | Report to government; share safety test results |
| Biological sequence models | More than 10^23 FLOPs (if primarily biological data) | Enhanced scrutiny for biosecurity risks |
| Red team testing | All covered models | Required before deployment |
| NIST AI RMF | Voluntary framework | Risk management guidance for AI lifecycle[^21] |
| [AI Safety Institute](/wiki/ai_safety_institute) | Established 2023 | Develops standards and evaluation frameworks[^22] |

The US AI Safety Institute, housed within NIST, was established in late 2023 to develop evaluation frameworks and conduct pre-deployment testing of frontier models. It played a role in developing the evaluation protocols used at the Bletchley AI Safety Summit and subsequently.

### United Kingdom approach

The UK created the AI Safety Institute (AISI) in November 2023, which was renamed the AI Security Institute in 2025. The institute focuses on capability-based evaluation rather than rigid compute thresholds, evaluating models across four broad categories:

1. Safeguards effectiveness: Whether safety measures actually prevent harmful outputs
2. Autonomy capabilities: The extent to which models can complete extended tasks without human intervention
3. Human influence potential: Capabilities to generate persuasive content, synthetic media, or targeted messaging
4. Societal resilience: Effects on critical infrastructure, information environments, and democratic institutions

By its 2025 year-in-review, the UK AISI's technical team had tested more than 30 of the world's most advanced models. The institute's publicly released Frontier AI Trends Report (2025) tracked capability changes across successive model generations, finding that AI models could complete apprentice-level cybersecurity tasks about 50 percent of the time by mid-2025, compared to just over 10 percent in early 2024. A model tested in 2025 was the first to successfully complete expert-level cyber tasks typically requiring over ten years of human professional experience.

The UK also led the creation of the AI Safety Institute International Network at the AI Seoul Summit in May 2024, connecting safety institutes across the US, UK, EU, Japan, Singapore, Canada, France, Kenya, Australia, South Korea, and other countries.

### California state legislation

California has been active in US state-level frontier AI governance:

- **SB 1047** (failed, 2024): Would have required safety testing for models trained with more than 10^26 FLOPs and at a cost above $100 million; vetoed by Governor Gavin Newsom in 2024.[^24]
- **SB 53** (passed, 2025): The Transparency in Frontier AI Act established reporting mechanisms and whistleblower protections for employees at frontier AI companies.[^25]

## What are the current frontier models? (May 2026)

As of May 2026, the following models are considered to sit at or near the frontier of AI capabilities across major capability dimensions. The list mixes proprietary flagships from the largest US labs with open-weight challengers from DeepSeek, Alibaba, and Mistral, reflecting a roster that now spans several countries and both closed and open release strategies.

| Model | Developer | Initial release | Key capabilities | Context window | Release model |
| --- | --- | --- | --- | --- | --- |
| Claude Opus 4.8 | [Anthropic](/wiki/anthropic) | May 2026 | Hybrid reasoning, agentic coding, vision, effort controls, faster/cheaper mode | 1 million tokens | Proprietary[^36] |
| [GPT-5.5](/wiki/gpt-5-5) | [OpenAI](/wiki/openai) | April 2026 | Agentic workflows, scientific research, code, multimodal (text/images/audio/video) | 1 million tokens | Proprietary[^34] |
| [Gemini 3.1 Pro](/wiki/gemini_3_pro) | [Google DeepMind](/wiki/google_deepmind) | February 2026 | Natively multimodal, 77.1% on ARC-AGI-2, strong coding and finance agents | 1 million tokens | Proprietary[^38] |
| [Grok 4.3](/wiki/grok_4) | [xAI](/wiki/xai) | May 2026 | Real-time knowledge, reasoning, native video input, slide generation | 1 million tokens | Proprietary[^39] |
| Muse Spark | [Meta](/wiki/meta) | April 2026 | First Meta Superintelligence Labs model; parallel reasoning; compute-efficient | 262,000 tokens | Proprietary[^41] |
| Qwen3.7-Max | [Alibaba](/wiki/alibaba) | May 2026 | Long-horizon agent (1,000+ tool calls), reasoning; top-ranked Chinese model | 1 million tokens | Proprietary[^43] |
| [DeepSeek V4](/wiki/deepseek_v4) | [DeepSeek](/wiki/deepseek) | April 2026 | Open-weight MoE, 1.6T parameters (49B activated in V4-Pro), top coding and STEM among open models | 1 million tokens | Open weights (MIT)[^37] |
| Mistral Large 3 | [Mistral](/wiki/mistral) | December 2025 | Largest open-weight MoE (675B total / 41B active); multilingual; vision | 256,000 tokens | Open weights (Apache 2.0)[^44] |

### Claude Opus 4.8

Claude Opus 4.8, released on May 28, 2026, is Anthropic's most capable generally available model as of mid-2026. It is a hybrid reasoning model with a 1 million token context window and a 128,000 token maximum output. The release added user-facing effort controls that let callers decide how much computation Claude devotes to a response, a "dynamic workflow" feature that runs multiple subagents in parallel, and a faster mode that Anthropic states runs at roughly 2.5 times the speed and about one third of the cost of prior fast modes. Anthropic positioned it as improving on Opus 4.6 while fixing the comment-verbosity and tool-calling issues reported with Opus 4.7. Pricing is $5 per million input tokens and $25 per million output tokens, available on the Claude Platform, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.[^36]

### Claude Opus 4.7

[Claude Opus 4.7](/wiki/claude_opus_4_7) was released on April 16, 2026. It introduced several new features: task budgets, which give the model a rough estimate of how many tokens to target for a full agentic loop including thinking, tool calls, and final output; high-resolution image support up to 2,576 pixels / 3.75 megapixels (increased from 1,568 pixels in prior versions); a new tokenizer; and a 128,000 token maximum output length. The model supports adaptive thinking and is available through the Anthropic API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. Pricing is $5 per million input tokens and $25 per million output tokens, unchanged from Opus 4.6.

### GPT-5.5

[GPT-5.5](/wiki/gpt-5-5) was released by OpenAI on April 23, 2026. The model processes text, images, audio, and video in a single unified architecture. OpenAI described it as matching GPT-5.4 per-token latency in real-world serving while performing at a substantially higher level of intelligence. OpenAI reported that GPT-5.5 scored 51.7 percent on [FrontierMath](/wiki/frontiermath) tiers 1-3 and 35.4 percent on the tier 4 research-level problems, with the GPT-5.5 Pro configuration reaching 52.4 percent on tiers 1-3 to lead the leaderboard as of late April 2026. The model also scored 82.7 percent on Terminal-Bench 2.0, an agentic command-line benchmark. OpenAI released a system card documenting safety evaluations across cybersecurity, CBRN, and persuasion risk categories.[^34] A streamlined variant, GPT-5.5 Instant, was released on May 5, 2026.

### Gemini 3.1 Pro

Gemini 3.1 Pro was released by Google DeepMind in preview on February 19, 2026. It is an iteration of the Gemini 3 Pro that Google released in late 2025. The model achieved a verified score of 77.1 percent on ARC-AGI-2, more than double the performance of Gemini 3 Pro. Gemini 3.1 Pro includes improved agentic behavior in domains such as finance and spreadsheet applications, and supports a 1 million token context window at $2 per million input tokens and $12 per million output tokens.[^38]

### Grok 4.x

[Grok 4](/wiki/grok_4) was the initial release from xAI in July 2025, trained on a 200,000 GPU cluster with reinforcement learning at pretraining scale. Grok 4.20 launched in beta on February 17, 2026, introducing a 2 million token context window and a 16-agent "Heavy" system. Grok 4.3 entered beta on grok.com and the SuperGrok apps on April 17, 2026, reached the public API on April 30, and rolled out broadly during the week of May 4, 2026, adding native video input, presentation slide generation, and enhanced long-context processing with a 1 million token context window. xAI describes it as its most intelligent and fastest model; independent testing by Artificial Analysis placed it at 53 on its Intelligence Index, below GPT-5.5 (60) and Gemini 3.1 Pro (57) but at a fraction of their price. xAI is training Grok 5, targeting a public beta in mid-2026.[^39]

### DeepSeek V4

[DeepSeek V4](/wiki/deepseek_v4) launched in preview in April 2026 with two MoE variants: DeepSeek-V4-Pro (1.6 trillion total parameters, 49 billion activated) and DeepSeek-V4-Flash (284 billion total parameters, 13 billion activated). Both support a 1 million token context window. The model pairs token-wise compression with DeepSeek Sparse Attention (DSA), which the developer states sharply reduces the compute and memory cost of long-context inference relative to DeepSeek-V3. V4-Pro leads open-weight models in world knowledge, mathematics, STEM, and coding benchmarks. Both variants were released under the permissive MIT license, and DeepSeek's open-weight strategy has positioned V4 as a challenger to closed frontier models at a fraction of their development cost, continuing the pattern established by DeepSeek-V2 and V3.[^37]

### Muse Spark

Muse Spark, released by [Meta](/wiki/meta) on April 8, 2026, was the first model from Meta Superintelligence Labs (MSL), the division led by Chief AI Officer Alexandr Wang following Meta's reorganization of its AI research and Llama development teams. Muse Spark is a closed-source model and an architectural departure from the open-weight Llama family that defined Meta's earlier strategy, a shift that drew commentary because Meta had built much of its AI reputation on open releases. Meta reported that Muse Spark reaches its reasoning capabilities using more than an order of magnitude less compute than Llama 4 Maverick, its previous mid-size flagship, and that it placed fourth on the Artificial Analysis Intelligence Index v4.0 behind Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6.[^41]

### Qwen3.7-Max

Qwen3.7-Max, released by [Alibaba](/wiki/alibaba) Cloud on May 19-20, 2026, is a proprietary, closed-weight reasoning model with a 1 million token context window and a native extended-thinking mode. Alibaba describes it as an agent-first model built for long-horizon work, citing an internal test in which the model sustained autonomous execution for up to 35 hours and performed more than 1,000 tool calls and iterative code edits. With a score of 56.6 on the Artificial Analysis Intelligence Index v4.0, it was the highest-ranked Chinese model on that index and entered the global top tier alongside Claude Opus 4.7 and GPT-5.5.[^43]

### Mistral Large 3

Mistral Large 3, released by [Mistral](/wiki/mistral) AI in December 2025, is the largest open-weight mixture-of-experts model from a major Western lab, with 675 billion total parameters and 41 billion active parameters, released under the Apache 2.0 license. It supports a 256,000 token context window, processes both text and images, and was trained with particular emphasis on non-English languages. Its open licensing and pricing of roughly $0.50 per million input tokens and $1.50 per million output tokens position it as a lower-cost alternative to closed frontier models for production workloads.[^44]

## How are frontier model capabilities evaluated?

### Overview

Evaluating frontier model capabilities is a rapidly evolving field. Standard academic benchmarks such as MMLU (massive multitask language understanding) and GSM8K (grade school mathematics) have been largely saturated by frontier models, prompting the development of harder evaluation suites. The principal evaluation frameworks in use as of 2026 include:

| Evaluation | Developer | Focus | Frontier model performance (2026) |
| --- | --- | --- | --- |
| [FrontierMath](/wiki/frontiermath) | Epoch AI | Expert-level mathematics | GPT-5.5 Pro: 52.4%; GPT-5.5: 51.7% |
| ARC-AGI-2 | ARC Prize | Novel logic pattern recognition | Gemini 3.1 Pro: 77.1%; frontier models generally 40-80% |
| SWE-bench Verified | Independent | Real-world software engineering | GPT-5: 74.9%; top frontier models 65-80% |
| METR Time Horizon | METR | Autonomous task completion duration | Claude Opus 4.6: ~14.5 hour tasks; GPT-5.2: ~6.6 hour tasks |
| GPQA Diamond | Independent | Graduate-level science (physics, chemistry, biology) | Frontier models 60-70%; human PhD experts ~65% |
| CyberGym | OpenAI / AISI | Cybersecurity capabilities | GPT-5.5: state-of-the-art |
| [MMLU](/wiki/mmlu) | Academic | Multitask language understanding (broad) | Frontier models: 86-92%; largely saturated |
| HealthBench Hard | OpenAI | Medical reasoning | GPT-5: 46.2% |

### METR time horizon methodology

Model Evaluation and Threat Research (METR) developed a "time horizon" methodology that measures the duration of tasks (measured by how long a human expert would take) at which an AI agent can complete them successfully at a specified reliability threshold. The 50-percent time horizon is the task duration at which an agent succeeds half the time.

METR published their original time-horizon dataset in March 2025, showing that the frontier time-horizon doubled approximately every seven months over the period 2019 to 2025. Time Horizon 1.1, released in January 2026, expanded the task suite by 34 percent (from 170 to 228 tasks) and doubled the number of tasks lasting eight hours or more. As of early 2026, Claude Opus 4.6 leads at approximately 14.5 hours, meaning the model can reliably complete tasks that would occupy a skilled human professional for more than half a working day.

### UK AISI evaluation findings

The UK AI Security Institute's Frontier AI Trends Report (2025) drew on evaluations of more than 30 frontier models released between 2022 and October 2025. Key findings included:

- AI models can now complete apprentice-level cybersecurity tasks around 50 percent of the time, compared to just over 10 percent in early 2024.
- In 2025 the institute tested the first model able to complete expert-level cyber tasks that would typically require a human practitioner with more than ten years of experience.
- GPT-5.5 was tested by AISI and found to be the second model to solve one of their multi-step cyber-attack simulations end-to-end.
- In late 2023, the most advanced models could complete software autonomy tasks lasting at least one hour less than five percent of the time. By mid-2025 they could do this over 40 percent of the time.

## Safety frameworks

### Anthropic Responsible Scaling Policy

Anthropic's Responsible Scaling Policy (RSP) is a self-imposed governance framework that ties the development and deployment of frontier models to demonstrated safety standards. Anthropic introduced the RSP in September 2023 as version 1.0, and has updated it several times since.

The RSP organizes models into AI Safety Levels (ASLs) analogous to biosafety levels:

- **ASL-1:** Systems that pose minimal risk even if used to cause harm, such as simple classifiers.
- **ASL-2:** Systems with early indicators of dangerous capabilities where current precautions are adequate. Most deployed Claude models fall here.
- **ASL-3:** Systems that could provide meaningful uplift to those seeking to create weapons capable of mass casualties, or which show signs of autonomous replication potential. Requires significantly enhanced security and deployment controls.
- **ASL-4:** Hypothetical systems approaching transformative autonomous capability, requiring controls that have not yet been fully specified.

The RSP requires Anthropic to pause deployment of a model that crosses into a higher ASL tier until adequate protections are in place. Critics have noted that the company itself determines when those protections are adequate, and some evaluators have argued the policy lacks external enforcement mechanisms.

Version 3.0 of the RSP, effective February 24, 2026, introduced several new accountability measures:

- **Risk Reports:** Detailed public documents explaining the safety profile of models, how capabilities, threat models, and active risk mitigations fit together, and an overall risk level assessment. Published online every three to six months.
- **Frontier Safety Roadmaps:** New requirement describing concrete plans for progress across security, alignment, safeguards, and policy.
- **Collective action framing:** The updated RSP explicitly acknowledges that the overall level of catastrophic risk from AI depends on the actions of multiple AI developers, not just Anthropic, and calls for coordinated industry norms.

### OpenAI Preparedness Framework

OpenAI published its Preparedness Framework in beta in December 2023 and updated it substantially in April 2025 (version 2). The framework is a structured process for tracking, evaluating, and preparing for catastrophic risks from frontier AI capabilities.

The framework evaluates models across four risk categories:

| Category | Description | Critical threshold example |
| --- | --- | --- |
| Cybersecurity | Ability to enable significant offensive cyber operations | End-to-end development of novel malware capable of critical infrastructure damage |
| CBRN | Uplift for chemical, biological, radiological, or nuclear weapons | Meaningful technical assistance with synthesis of biological agents capable of mass casualties |
| Persuasion | Ability to influence beliefs at societal scale | Campaigns indistinguishable from human-created content that measurably shift political opinion |
| Model autonomy | Self-replication, resource acquisition, and ability to undermine oversight | Ability to copy model weights, acquire compute, and take actions that prevent shutdown |

Models are assigned risk levels (low, medium, high, critical) based on evaluations in each category. OpenAI's policy states that models scoring "high" or above in any category require additional safeguards before deployment; models scoring "critical" in any category cannot be deployed. An internal Safety Advisory Group (SAG), a cross-functional group of OpenAI leaders, oversees the framework.

A provision in the framework allowing OpenAI to lower safety requirements if a competitor releases a system with "High" or "Critical" capability levels has drawn criticism from safety researchers, who argue it institutionalizes competitive pressure as grounds for relaxing safety standards.

Version 2 of the framework, released April 2025, sharpened the focus on specific risk categories, strengthened the definition of what it means to "sufficiently minimize" risk in practice, and added clearer operational guidance on evaluation, governance, and disclosure.

### Restricted release: the Claude Mythos case

Frontier safety frameworks can in principle lead a developer to withhold a model from general release. The clearest example to date is Claude Mythos, which [Anthropic](/wiki/anthropic) previewed on April 7, 2026. Anthropic described Mythos as a capability tier above [Claude Opus 4.7](/wiki/claude_opus_4_7) and reported benchmark results including 93.9 percent on SWE-bench Verified. Rather than releasing it broadly, Anthropic restricted access through an invitation-only partner program (referred to as Project Glasswing) for a small set of vetted organizations and critical-infrastructure operators, at partner pricing several times that of Opus 4.7.[^40]

The restriction was driven primarily by offensive-cybersecurity capability. Anthropic reported using Mythos to autonomously discover thousands of previously unknown ("zero-day") software vulnerabilities across major operating systems and web browsers, and the UK [AI Security Institute](/wiki/ai_safety_institute) reported that the model succeeded on expert-level hacking tasks at a rate that no AI system could achieve as recently as April 2025. These claims originate with Anthropic and the evaluating institute and had not, as of May 2026, been independently reproduced at large scale, but the episode is frequently cited as a concrete instance of a lab declining to release a frontier model on safety grounds.

## FrontierMath benchmark

[FrontierMath](/wiki/frontiermath) is a benchmark of hundreds of original mathematics problems created and vetted by expert mathematicians, released in November 2024 by Epoch AI. The benchmark is designed to measure the mathematical reasoning capabilities of frontier AI models in a way that is resistant to contamination from training data, because the problems are novel and not derived from existing published problem sets.

Problems are drawn from most major branches of modern mathematics, including number theory, real analysis, algebraic geometry, and category theory. The benchmark is organized into four difficulty tiers:

- **Tiers 1-3:** Undergraduate through early postdoctoral level problems.
- **Tier 4:** Research-level mathematics problems that might require a human specialist hours or days to solve.

Models submit answers as executable Python code, meaning scores reflect mathematical reasoning with access to computational tools rather than pen-and-paper derivations. Fields Medalists Terence Tao, Timothy Gowers, and Richard Borcherds, along with International Mathematical Olympiad coach Evan Chen, contributed to the benchmark's design and verification.

When FrontierMath was released in November 2024, state-of-the-art models solved under two percent of problems. By April 2026, performance had increased dramatically:

| Model | FrontierMath score (tiers 1-3) | Tier 4 score |
| --- | --- | --- |
| GPT-5.5 Pro | 52.4% | ~30% |
| GPT-5.5 | 51.7% | ~30% |
| GPT-5.4 Pro | 50.0% | ~28% |
| Claude Opus 4.6 | Over 40% | Over 30% |
| GPT-5.2 | Over 40% | Over 30% |

The rapid improvement in FrontierMath scores over roughly 18 months has been cited as evidence of fast-moving capability gains in mathematical reasoning, though some researchers note that improvements in post-training methods (extended chain-of-thought, tool use, and reinforcement learning on verifiable outcomes) may account for a larger share of these gains than scaling of pre-training compute alone.

## Risks and safety

### Technical and capability risks

| Risk category | Description | Examples | Mitigation approaches |
| --- | --- | --- | --- |
| Emergent capabilities | Unexpected abilities appearing at scale | In-context learning, tool use, deception | Comprehensive evaluation, capability discovery research[^9] |
| [Hallucination](/wiki/hallucination) | Generation of false or misleading information | Fabricated citations, incorrect facts | Improved training, retrieval augmentation, uncertainty quantification |
| Jailbreaking | Bypassing safety constraints | Harmful content generation, misuse instructions | Adversarial training, [constitutional classifiers](/wiki/constitutional_classifiers), robust safety layers |
| Loss of control | Models acting beyond intended parameters | Reward hacking, mesa-optimization | Alignment research, interpretability, shutdown mechanisms |

### Misuse risks

- **Cybersecurity threats:** Automated vulnerability discovery, sophisticated phishing, malware generation. The UK AISI found that multiple frontier models can now complete expert-level cyber tasks that previously required seasoned professionals.
- **Information warfare:** Large-scale disinformation campaigns, deepfakes, synthetic propaganda. Persuasion capabilities are one of the four categories tracked in OpenAI's Preparedness Framework.
- **CBRN risks:** Lowering barriers to chemical, biological, radiological, or nuclear weapon development. Both the RSP's ASL-3 threshold and OpenAI's Preparedness Framework treat bioweapon uplift as a critical risk category.[^30]
- **Privacy violations:** Personal data extraction, surveillance capabilities, targeted profiling at scale.

### Societal and structural risks

| Risk | Impact | Affected groups |
| --- | --- | --- |
| Labor displacement | Automation of cognitive work | Knowledge workers, creative professionals |
| Economic concentration | Market dominance by few companies | Smaller firms, developing nations |
| Bias amplification | Perpetuation of historical prejudices | Marginalized communities |
| Democratic erosion | Manipulation of public discourse | Citizens, democratic institutions |
| Environmental impact | Massive energy consumption for training and inference | Global climate, local communities[^31] |

## Resource requirements and development costs

### Computational requirements

| Aspect | Current scale (2026) | Projected (2028) |
| --- | --- | --- |
| Training compute | 10^25 - 10^27 FLOPs | 10^27 - 10^29 FLOPs |
| Training cost | $100M - $1B+ | $1B - $10B |
| Training duration | 3-6 months | 6-12 months |
| GPU requirements | 10,000 - 100,000+ GPUs | 200,000+ GPUs |
| Energy consumption | 50-500 GWh | 1,000+ GWh |

### Infrastructure needs

- **Hardware:** Specialized GPUs (H100, H200, B200) or TPUs
- **Data centers:** Hyperscale facilities with advanced cooling
- **Networking:** High-bandwidth interconnects (InfiniBand)
- **Storage:** Petabyte-scale distributed systems
- **Software stack:** Custom training frameworks and optimization tools[^32]

## Comparison of current frontier models

The following table summarizes key specifications and benchmark positions for the leading frontier models as of May 2026:

| Model | Developer | Context window | Key benchmark strengths | Safety framework | Availability |
| --- | --- | --- | --- | --- | --- |
| Claude Opus 4.8 | [Anthropic](/wiki/anthropic) | 1M tokens | METR time horizon leader; agentic coding; effort controls | RSP v3.0 (ASL-2/3 threshold) | API, Bedrock, Vertex, Foundry |
| [GPT-5.5](/wiki/gpt-5-5) | [OpenAI](/wiki/openai) | 1M tokens | FrontierMath leader (Pro: 52.4%); AISI cyber evaluation | Preparedness Framework v2 | ChatGPT Plus/Pro; API |
| [Gemini 3.1 Pro](/wiki/gemini_3_pro) | [Google DeepMind](/wiki/google_deepmind) | 1M tokens | ARC-AGI-2: 77.1%; multimodal; finance agents | Google Frontier Safety Framework | Google AI Pro/Ultra; Vertex |
| [Grok 4.3](/wiki/grok_4) | [xAI](/wiki/xai) | 1M tokens | Real-time search; native video input; speed | xAI risk management framework | SuperGrok, Premium+; xAI API |
| Muse Spark | [Meta](/wiki/meta) | 262K tokens | Parallel reasoning; compute efficiency | Meta Frontier AI Framework | Meta AI; API |
| Qwen3.7-Max | [Alibaba](/wiki/alibaba) | 1M tokens | Long-horizon agent; top-ranked Chinese model (II 56.6) | Provider self-assessment | Alibaba Cloud Model Studio; API |
| [DeepSeek V4-Pro](/wiki/deepseek_v4) | [DeepSeek](/wiki/deepseek) | 1M tokens | Open-weight STEM and coding leader; MoE efficiency | Open weights; community auditable | Open weights (Hugging Face); API |
| Mistral Large 3 | [Mistral](/wiki/mistral) | 256K tokens | Largest open-weight MoE; multilingual; vision | Open weights; community auditable | Open weights (Hugging Face); API |

## Future outlook

### Technical developments

Projections from Epoch AI and other research organizations suggest that training compute for the most advanced frontier models may reach 10^28 FLOPs by 2027 or 2028, roughly three orders of magnitude above the EU AI Act's systemic risk threshold. As the number of models crossing the 10^25 FLOPs threshold grows from roughly 23 in 2024 toward over 100 by 2028, compute-threshold-based regulation faces pressure to recalibrate.

The METR time-horizon doubling trend, if it continues, implies that by the end of 2026 frontier agents could reliably complete tasks requiring over 30 hours of skilled human work. This would represent a qualitative shift from AI as a tool that assists humans on discrete tasks toward AI as a system that can autonomously manage extended multi-step projects.

xAI is training Grok 5, with a public beta expected in mid-2026. Anthropic, OpenAI, and Google are widely expected to release further frontier model iterations through 2026, and the pace of releases has been roughly one major frontier model per major lab per quarter since late 2025. The roster has also broadened beyond the US labs: open-weight models from DeepSeek and Mistral and proprietary models from Alibaba (Qwen) now appear in the global top tier, while Meta's pivot from open Llama releases to the closed Muse Spark illustrates that release strategy itself is becoming a competitive and governance variable.

### Research priorities

According to the International Scientific Report on Advanced AI Safety and related publications from major safety research organizations, the highest-priority research areas include:

- Mechanistic interpretability: Understanding what computations frontier models are actually performing, enabling verification of safety-relevant properties
- Scalable oversight: Maintaining meaningful human supervision of systems whose outputs humans cannot always evaluate directly
- Adversarial robustness: Ensuring that safety-relevant behaviors are stable under adversarial prompting and across deployment contexts
- Alignment research: Developing formal methods to specify what behavior is wanted and verify that models pursue it
- Capability evaluation: Building benchmark suites that remain challenging and informative as frontier capabilities continue to advance

### Governance evolution

The EU AI Act's GPAI provisions took effect in August 2025 and established the first binding legal requirements for frontier model providers. As of May 2026, the European AI Office is developing more detailed technical standards in collaboration with industry. The AI Safety Institute International Network continues to expand, with member countries coordinating evaluation methodologies. The Frontier Model Forum's information-sharing agreement, signed in March 2025, represents an early attempt at industry-to-industry and industry-to-government intelligence sharing on frontier capabilities of concern.

The shift from purely voluntary commitments (the 2023 White House voluntary commitments) to binding regulation (the EU AI Act) within roughly two years reflects the speed at which frontier model governance has moved from aspiration to legal obligation. Further regulatory activity at the US federal level and in other major jurisdictions is widely anticipated, though its shape remains contested.

## See also

- [Foundation models](/wiki/foundation_models)
- [AI alignment](/wiki/ai_alignment)
- [AI Safety Institute](/wiki/ai_safety_institute)
- [FrontierMath](/wiki/frontiermath)
- [Constitutional classifiers](/wiki/constitutional_classifiers)
- [Transformer](/wiki/transformer)
- [Artificial general intelligence](/wiki/artificial_general_intelligence)
- [AI Safety Summit](/wiki/ai_safety_summit)
- [EU AI Act](/wiki/eu_ai_act)
- [Claude Opus 4.5](/wiki/claude_opus_4_5)
- [Claude Opus 4.7](/wiki/claude_opus_4_7)
- [GPT-5](/wiki/gpt-5)
- [GPT-5.5](/wiki/gpt-5-5)
- [Gemini 3 Pro](/wiki/gemini_3_pro)
- [DeepSeek V4](/wiki/deepseek_v4)
- [Grok 4](/wiki/grok_4)
- [Meta](/wiki/meta)
- [Alibaba](/wiki/alibaba)
- [Mistral](/wiki/mistral)
- [Large language model](/wiki/large_language_model)
- [Scaling laws](/wiki/scaling_laws)
- [AI safety](/wiki/ai_safety)

## References

[^1]: Suleyman, M. (2023). "The Coming Wave" and DeepMind commentary on the AI frontier. https://www.deepmind.com/ Accessed 2026-05-31.

[^2]: Bommasani, R., et al. (2021). "On the Opportunities and Risks of Foundation Models." Stanford HAI / CRFM. https://arxiv.org/abs/2108.07258 Accessed 2026-05-31.

[^3]: Epoch AI. (2024). "Parameter, Compute and Data Trends in Machine Learning." https://epoch.ai/data/notable-ai-models Accessed 2026-05-31.

[^4]: Governance AI Centre (GovAI). (2025). "Trends in Frontier AI Model Count: A Forecast to 2028." arXiv:2504.16138. https://arxiv.org/abs/2504.16138 Accessed 2026-05-31.

[^5]: Anderljung, M., et al. (2023). "Frontier AI Regulation: Managing Emerging Risks to Public Safety." arXiv:2307.03718. https://arxiv.org/abs/2307.03718 Accessed 2026-05-31.

[^6]: UK Department for Science, Innovation and Technology. (2023). "A pro-innovation approach to AI regulation: government response." https://www.gov.uk/government/consultations/ai-regulation-a-pro-innovation-approach-policy-proposals/outcome/a-pro-innovation-approach-to-ai-regulation-government-response Accessed 2026-05-31.

[^7]: UK Government / AI Safety Institute. (2023). "Capabilities and risks from frontier AI." Discussion paper. https://www.gov.uk/government/publications/frontier-ai-capabilities-and-risks-discussion-paper Accessed 2026-05-31.

[^8]: Frontier Model Forum. (2023). "What is the Frontier Model Forum?" https://www.frontiermodelforum.org/ Accessed 2026-05-31.

[^9]: Wei, J., et al. (2022). "Emergent Abilities of Large Language Models." Transactions on Machine Learning Research. https://arxiv.org/abs/2206.07682 Accessed 2026-05-31.

[^10]: Anthropic. (2024). "The Claude 3 Model Family: Opus, Sonnet, Haiku." https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf Accessed 2026-05-31.

[^11]: Bommasani, R., et al. (2021). "On the Opportunities and Risks of Foundation Models." Stanford CRFM. https://crfm.stanford.edu/report.html Accessed 2026-05-31.

[^12]: Silver, D., et al. (2017). "Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm." DeepMind. arXiv:1712.01815. https://arxiv.org/abs/1712.01815 Accessed 2026-05-31.

[^13]: OpenAI. (2023). "GPT-4 Technical Report." arXiv:2303.08774. https://arxiv.org/abs/2303.08774 Accessed 2026-05-31.

[^14]: Center for AI Safety. (2023). "Statement on AI Risk." https://www.safe.ai/work/statement-on-ai-risk Accessed 2026-05-31.

[^15]: Frontier Model Forum / OpenAI, Anthropic, Google, Microsoft. (2023). "Frontier Model Forum." https://openai.com/index/frontier-model-forum/ Accessed 2026-05-31.

[^16]: The White House. (2023). "Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence." October 30, 2023. https://bidenwhitehouse.archives.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/ Accessed 2026-05-31.

[^17]: UK Government. (2023). "The Bletchley Declaration by Countries Attending the AI Safety Summit." November 1-2, 2023. https://www.gov.uk/government/publications/ai-safety-summit-2023-the-bletchley-declaration/the-bletchley-declaration-by-countries-attending-the-ai-safety-summit-1-2-november-2023 Accessed 2026-05-31.

[^18]: European Parliament and Council. (2024). "Regulation (EU) 2024/1689 (Artificial Intelligence Act)." https://eur-lex.europa.eu/eli/reg/2024/1689/oj Accessed 2026-05-31.

[^19]: European Commission. "AI Act Article 51: Classification of GPAI models as GPAI models with systemic risk." https://artificialintelligenceact.eu/article/51/ Accessed 2026-05-31.

[^20]: European Commission. (2025). "Guidelines for providers of general-purpose AI models." https://digital-strategy.ec.europa.eu/en/policies/guidelines-gpai-providers Accessed 2026-05-31.

[^21]: National Institute of Standards and Technology. (2023). "AI Risk Management Framework (AI RMF 1.0)." https://www.nist.gov/itl/ai-risk-management-framework Accessed 2026-05-31.

[^22]: US Department of Commerce / NIST. (2023-2024). "U.S. AI Safety Institute." https://www.nist.gov/aisi Accessed 2026-05-31.

[^23]: UK AI Security Institute. (2025). "Frontier AI Trends Report." https://www.aisi.gov.uk/ Accessed 2026-05-31.

[^24]: California Legislature. (2024). "SB 1047: Safe and Secure Innovation for Frontier Artificial Intelligence Models Act." https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=202320240SB1047 Accessed 2026-05-31.

[^25]: California Legislature. (2025). "SB 53: Transparency in Frontier Artificial Intelligence Act." https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=202520260SB53 Accessed 2026-05-31.

[^26]: OECD. (2024). "Recommendation of the Council on Artificial Intelligence (updated OECD AI Principles)." https://oecd.ai/en/ai-principles Accessed 2026-05-31.

[^27]: Frontier Model Forum. (2024). "Amazon and Meta join the Frontier Model Forum." https://www.frontiermodelforum.org/updates/amazon-and-meta-join-the-frontier-model-forum/ Accessed 2026-05-31.

[^28]: Frontier Model Forum. (2023). "Announcing the Executive Director of the Frontier Model Forum and over $10M for a new AI Safety Fund." https://openai.com/index/frontier-model-forum-updates/ Accessed 2026-05-31.

[^29]: UK Government. (2024). "AI Seoul Summit 2024: International network of AI Safety Institutes." https://www.gov.uk/government/topical-events/ai-seoul-summit-2024 Accessed 2026-05-31.

[^30]: Anthropic. (2023). "Anthropic's Responsible Scaling Policy." https://www.anthropic.com/news/anthropics-responsible-scaling-policy Accessed 2026-05-31.

[^31]: Luccioni, S., et al. (2024). "Power Hungry Processing: Watts Driving the Cost of AI Deployment?" arXiv:2311.16863. https://arxiv.org/abs/2311.16863 Accessed 2026-05-31.

[^32]: Kaplan, J., et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361. https://arxiv.org/abs/2001.08361 Accessed 2026-05-31.

[^33]: Bengio, Y., et al. (2025). "International AI Safety Report." https://www.gov.uk/government/publications/international-ai-safety-report-2025 Accessed 2026-05-31.

[^34]: OpenAI. (2026). "Introducing GPT-5.5." April 23, 2026. https://openai.com/index/introducing-gpt-5-5/ Accessed 2026-05-31.

[^35]: Anthropic. (2026). "Introducing Claude Opus 4.7." April 16, 2026. https://www.anthropic.com/news/claude-opus-4-7 Accessed 2026-05-31.

[^36]: Anthropic. (2026). "Claude Opus 4.8." Model page and release coverage. May 28, 2026. https://www.anthropic.com/claude/opus Accessed 2026-05-31.

[^37]: DeepSeek. (2026). "DeepSeek-V4 Preview." API documentation. April 24, 2026. https://api-docs.deepseek.com/news/news260424 Accessed 2026-05-31.

[^38]: Google. (2026). "Gemini 3.1 Pro: A smarter model for your most complex tasks." February 19, 2026. https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/ Accessed 2026-05-31.

[^39]: xAI. (2026). "Grok 4.3." Release coverage and benchmark summaries (Artificial Analysis Intelligence Index). May 2026. https://x.ai/news Accessed 2026-05-31.

[^40]: Anthropic. (2026). "Claude Mythos Preview." red.anthropic.com. April 7, 2026. https://red.anthropic.com/2026/mythos-preview/ Accessed 2026-05-31.

[^41]: Meta. (2026). "Introducing Muse Spark: Scaling Towards Personal Superintelligence." Meta Superintelligence Labs. April 8, 2026. https://ai.meta.com/blog/introducing-muse-spark-msl/ Accessed 2026-05-31.

[^42]: European Commission. "Regulatory framework for AI (AI Act): timeline and enforcement of GPAI obligations." https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai Accessed 2026-05-31.

[^43]: Alibaba Cloud. (2026). "Qwen3.7-Max." Release coverage. May 19-20, 2026. https://technode.com/2026/05/21/alibaba-introduces-qwen3-7-max-as-next-gen-ai-agent-model/ Accessed 2026-05-31.

[^44]: Mistral AI. (2025). "Introducing Mistral 3." December 2025. https://mistral.ai/news/mistral-3/ Accessed 2026-05-31.