Frontier models
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v5 ยท 8,085 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v5 ยท 8,085 words
Add missing citations, update stale details, or suggest a clearer explanation.
Frontier models are the most advanced artificial intelligence models that represent the cutting edge of AI capabilities at any given time. These highly capable foundation models, generally implemented as very large language models, push the boundaries of what is possible with AI technology, typically characterized by their massive scale, multimodal capabilities, and ability to perform a wide variety of complex tasks across different domains.[1][2] The term encompasses both foundation models and general-purpose AI systems that exceed the capabilities of existing models and often require significant computational resources to develop and deploy, typically exceeding 10^25 floating-point operations (FLOPs) during training.[3][4]
The concept of frontier models was formally introduced in a 2023 paper by OpenAI titled "Frontier AI Regulation: Managing Emerging Risks to Public Safety," where they are defined as "highly capable foundation models that could possess dangerous capabilities sufficient to pose severe risks to public safety."[5] This definition highlights three key challenges: dangerous capabilities can emerge unexpectedly during development or post-deployment; it is difficult to prevent misuse of deployed models; and model capabilities can proliferate rapidly, especially through open-source releases.
The United Kingdom government, a leading proponent of the term, defines frontier AI as "highly capable general-purpose AI models that can perform a wide variety of tasks and match or exceed the capabilities present in today's most advanced models."[6] The UK's analysis emphasizes that frontier models are increasingly multimodal and may be augmented into autonomous AI agents with tool use (for example web browsing and code execution) that expand their real-world impact.[7]
The Frontier Model Forum, an industry body established by leading AI companies, defines frontier models as "large-scale machine-learning models that exceed the capabilities currently present in the most advanced existing models, and can perform a wide variety of tasks."[8]
The phrase "frontier AI" gained currency through the UK government's policy work leading up to the November 2023 AI Safety Summit at Bletchley Park. Prior to that summit, UK officials used the term in consultations and white papers to distinguish the most capable general-purpose models from narrower AI systems. The term was simultaneously taken up by OpenAI and Anthropic in published AI safety research and company policy documents.
The word "frontier" is borrowed from the economics literature on production possibilities and technological frontiers, where it describes the outer boundary of what is technically achievable. Applied to AI, the frontier is not a fixed line but a moving one: a model that sits at the frontier today becomes the baseline against which future systems are measured. This dynamism is central to how governments and researchers use the term, because any fixed capability threshold will eventually be crossed by a wider range of systems as the field advances.
The Frontier Model Forum (FMF), established in July 2023 by Anthropic, Google, Microsoft, and OpenAI, gave the terminology institutional weight and helped anchor it across industry, policy, and academic discourse.
Frontier models typically exhibit several defining characteristics:
| Characteristic | Description | Implication |
|---|---|---|
| Massive scale and compute | Trained using vast computational resources, often exceeding 10^25 FLOPs, with parameter counts in the hundreds of billions or trillions, and training costs ranging from tens to hundreds of millions of dollars | High development costs create barriers to entry, concentrating power in well-funded labs. Scale is a primary driver of advanced capabilities |
| Multimodal capabilities | Ability to process and generate multiple types of data including text, images, audio, and video | Enables broad applicability across domains but complicates safety evaluation and control |
| Emergent properties | Display abilities that were not explicitly programmed, such as complex reasoning, chain-of-thought reasoning, code generation, or creative writing | Makes pre-deployment risk assessment challenging. The full capabilities may not be understood until widely deployed[9] |
| General-purpose functionality | Designed to be adaptable across a wide range of domains and tasks with minimal fine-tuning | Enormous utility but difficult to predict all possible uses, including malicious applications |
| Extended context windows | Modern frontier models can process extensive amounts of information, with some models supporting context windows of over 1 million tokens | Enables complex, long-form tasks but increases potential for sophisticated manipulation[10] |
| Autonomy potential | Advanced models show increasing ability to act autonomously to achieve goals, using tools, accessing external information, and self-correcting | Raises long-term safety concerns about control and alignment with human values |
The history of frontier models is intertwined with the evolution of foundation models, which emerged in the late 2010s. The term "foundation model" was coined in 2021 by researchers at the Stanford Institute for Human-Centered Artificial Intelligence (HAI) to describe large-scale models trained on broad data using self-supervision.[11]
The first models trained on over 10^23 FLOPs were AlphaGo Master and AlphaGo Zero, developed by DeepMind and published in 2017.[12] The progression to current frontier models proceeded rapidly through successive generations of transformer-based language models.
| Date | Milestone | Significance |
|---|---|---|
| 2017 | AlphaGo Master/Zero | First models exceeding 10^23 FLOPs |
| 2018 | BERT (Google) | Demonstrated power of transformer-based language models |
| 2019 | GPT-2 (OpenAI) | Raised concerns about misuse, leading to staged release |
| 2020 | GPT-3 (OpenAI) | 175 billion parameters, sparked widespread interest in large language models |
| March 2023 | GPT-4 (OpenAI) | First model widely acknowledged as exceeding 10^25 FLOPs, demonstrated significant capability leap[13] |
| May 2023 | Joint AI Safety Statement | CEOs of major AI labs declared extinction risk from AI a global priority[14] |
| July 2023 | Frontier Model Forum launched | Industry coordination body established by Anthropic, Google, Microsoft, and OpenAI[15] |
| July 2023 | White House voluntary commitments | Seven leading AI companies committed to safety, red-teaming, and information-sharing with the US government |
| October 2023 | U.S. Executive Order 14110 | Established compute-based reporting requirements for frontier models[16] |
| November 2023 | Bletchley Declaration | 28 countries agreed on frontier AI risks and cooperation at the UK AI Safety Summit[17] |
| 2024 | EU AI Act finalized | Establishes regulations for "general-purpose AI models with systemic risk"[18] |
| May 2024 | AI Seoul Summit | Launched the AI Safety Institute International Network; Amazon and Meta joined the FMF |
| August 2025 | GPT-5 (OpenAI) | Major capability milestone; state-of-the-art across coding, math, and scientific reasoning |
| February 2026 | Anthropic RSP v3.0 | Substantially updated Responsible Scaling Policy with new transparency and accountability measures |
| April 2026 | Claude Mythos preview (Anthropic) | Announced April 7, 2026; restricted-release model a capability tier above Opus 4.7, withheld from general availability over offensive-cybersecurity risk[40] |
| April 2026 | Muse Spark (Meta) | Released April 8, 2026; first model from Meta Superintelligence Labs, a closed-source departure from the Llama family[41] |
| April 2026 | GPT-5.5 (OpenAI) | Released April 23, 2026; improved agentic capabilities and scientific reasoning[34] |
| April 2026 | Claude Opus 4.7 (Anthropic) | Released April 16, 2026; high-resolution image support, 1M context, and task budgets[35] |
| May 2026 | DeepSeek V4 / Qwen3.7-Max | Open-weight DeepSeek V4 (April 24) and Alibaba's Qwen3.7-Max (May 19-20) place Chinese labs in the global top tier[37][43] |
| May 2026 | Claude Opus 4.8 (Anthropic) | Released May 28, 2026; current Anthropic flagship, adding effort controls and a faster, cheaper mode[36] |
| August 2026 | EU AI Act enforcement | European Commission gains powers to fine GPAI providers (up to 3% of global turnover or EUR 15 million) from August 2, 2026[42] |
The Frontier Model Forum (FMF) was established in July 2023 by Anthropic, Google, Microsoft, and OpenAI as the primary industry coordination body for frontier AI safety. Amazon and Meta joined in May 2024, expanding the membership to six of the world's largest AI developers.
| Aspect | Details |
|---|---|
| Current members (2026) | Amazon, Anthropic, Google, Meta, Microsoft, OpenAI[27] |
| Key objectives | AI safety research, best practices, policy collaboration, societal applications |
| AI Safety Fund | $10 million for independent safety research, managed directly by FMF from June 2025 |
| Executive Director | Chris Meserole (appointed October 2023) |
| Focus areas | CBRN risks, cyber capabilities, societal impacts, evaluation standards |
In March 2025 the FMF announced a first-of-its-kind information-sharing agreement enabling member firms to exchange information about vulnerabilities, threats, and capabilities of concern unique to frontier AI. The Forum intends to pilot voluntary information-sharing with non-FMF frontier AI companies as well. In January 2026 the FMF published work on "Chain of Thought Monitorability," examining techniques for verifying the faithfulness of reasoning traces in large language models.
The FMF's AI Safety Fund has supported multiple cohorts of independent safety researchers. After the Meridian Institute announced in June 2025 that it would be winding down its operations, the FMF began managing the fund directly, issuing grants in areas including biosecurity, cybersecurity, AI agent evaluation, and synthetic content detection.
The most widely used quantitative marker for a frontier model is a training compute of 10^25 floating-point operations (FLOPs). This threshold has been adopted in several regulatory contexts, including the EU AI Act and the US Executive Order on AI, because it correlates approximately with the point at which models have shown the ability to acquire novel dangerous capabilities, though the correspondence is imperfect and contested.
The threshold captures several practical realities. Models trained at or above 10^25 FLOPs require specialized data center infrastructure, large GPU or TPU clusters, and training budgets typically exceeding $50 million. These resource requirements mean that relatively few organizations can train such systems, making compute a tractable lever for oversight. By contrast, models below this threshold can be reproduced by smaller research organizations or derived from open weights, making supply-side compute controls less effective for them.
As of 2025, research from the Governance AI Centre estimated that approximately 23 models exceeded 10^25 FLOPs at the end of 2024, with projections suggesting between 103 and 306 models could cross this threshold by 2028. This rapid proliferation means that compute-threshold-based regulation will need to be recalibrated over time to remain meaningful.[4]
The EU AI Act distinguishes two tiers of compute:
Researchers and policymakers have noted several limitations of relying exclusively on training compute as a threshold. Algorithmic improvements, more efficient architectures (such as mixture-of-experts), and longer post-training processes (including reinforcement learning from human feedback and reinforcement learning on verifiable outcomes) can produce highly capable models at lower pre-training compute costs. Reasoning training compute in particular has grown roughly tenfold every three to five months, far outpacing the four-to-five times annual growth in pre-training compute, and a raw sum of pre-training and post-training compute may become a progressively worse proxy for model capabilities as training recipes evolve.[32] This complicates the historical scaling laws relationship in which capability gains tracked smoothly with increases in pre-training compute, parameters, and data.
Some researchers advocate for capability-based thresholds instead of or alongside compute thresholds, measuring whether a model can perform specific dangerous tasks rather than how much compute was used to train it. This approach aligns with the UK AI Safety Institute's evaluation methodology and with the METR time-horizon framework.
The Bletchley Declaration, signed at the AI Safety Summit at Bletchley Park on November 1-2, 2023, was the first international agreement to formally define and address frontier AI risks. It was signed by 28 countries including the United States, China, the European Union, and nations from Africa, the Middle East, and Asia.
The Declaration defined frontier AI as "highly capable general-purpose AI models that can perform a wide variety of tasks and match or exceed the capabilities present in today's most advanced models." It noted that frontier AI presents substantial risks from "potential intentional misuse or unintended issues of control," with particular concern in the domains of cybersecurity and biotechnology.
Key commitments in the Declaration included:
On July 21, 2023, the Biden administration secured voluntary commitments from seven leading AI companies: Amazon, Anthropic, Google, Inflection AI, Meta, Microsoft, and OpenAI. A second set of commitments from eight additional companies including IBM, Nvidia, and Palantir followed in September 2023.
The commitments addressed safety practices considered appropriate for frontier models specifically. Companies committed to:
The EU AI Act, which took effect in August 2025, establishes the most detailed regulatory regime currently applicable to frontier models. It uses the term "general-purpose AI models with systemic risk" (GPAISR) rather than "frontier models" but covers the same set of systems.
| Aspect | Requirement | Details |
|---|---|---|
| Compute threshold | 10^25 FLOPs | Models exceeding this threshold are presumed to have systemic risk[19] |
| GPAI baseline threshold | 10^23 FLOPs | All GPAI models above this threshold face transparency and copyright obligations |
| Model evaluations | Mandatory testing | Standardized protocols and adversarial testing required |
| Risk assessment | Systemic risk evaluation | Must assess and mitigate potential societal-scale risks |
| Incident reporting | Report to AI Office | Serious incidents must be reported to the EU AI Office |
| Cybersecurity | Adequate protections | Ensure model and weight protection against misuse |
| Documentation | Technical documentation | Comprehensive documentation for downstream providers[20] |
| Compliance deadline | August 2, 2025 | Full implementation required for covered models; models on market before this date have until August 2027 |
| Notification | Within two weeks | Providers must notify the European Commission's AI Office within two weeks of crossing the 10^25 FLOPs threshold |
On July 10, 2025 the European AI Office published the final General-Purpose AI Code of Practice, a voluntary instrument drafted by independent experts that gives providers a presumption-of-conformity route to meeting the transparency, copyright, and safety-and-security obligations; alongside it the Commission issued guidelines clarifying how the training compute threshold should be calculated and what documentation requirements apply to open-weight and open-source models. The obligations for GPAI providers became applicable on August 2, 2025, and from August 2, 2026 the Commission gains powers to enforce them, including fines of up to 3 percent of global annual turnover or EUR 15 million, whichever is higher, under Article 101.[42] Separately, a "Digital Omnibus" simplification package amending parts of the AI Act was adopted by the Commission on November 19, 2025, with a political agreement reached on May 7, 2026 that, among other changes, extends the transition period for certain high-risk AI systems to August 2, 2028.
The U.S. approach has centered on Executive Order 14110 on Safe, Secure, and Trustworthy AI, issued in October 2023. The Executive Order used a threshold of 10^26 FLOPs (one order of magnitude above the EU threshold) for the most intensive reporting requirements, while also establishing requirements for biological sequence models above 10^23 FLOPs due to biosecurity concerns.
| Component | Threshold | Requirements |
|---|---|---|
| Dual-use foundation models | More than 10^26 FLOPs | Report to government; share safety test results |
| Biological sequence models | More than 10^23 FLOPs (if primarily biological data) | Enhanced scrutiny for biosecurity risks |
| Red team testing | All covered models | Required before deployment |
| NIST AI RMF | Voluntary framework | Risk management guidance for AI lifecycle[21] |
| AI Safety Institute | Established 2023 | Develops standards and evaluation frameworks[22] |
The US AI Safety Institute, housed within NIST, was established in late 2023 to develop evaluation frameworks and conduct pre-deployment testing of frontier models. It played a role in developing the evaluation protocols used at the Bletchley AI Safety Summit and subsequently.
The UK created the AI Safety Institute (AISI) in November 2023, which was renamed the AI Security Institute in 2025. The institute focuses on capability-based evaluation rather than rigid compute thresholds, evaluating models across four broad categories:
By its 2025 year-in-review, the UK AISI's technical team had tested more than 30 of the world's most advanced models. The institute's publicly released Frontier AI Trends Report (2025) tracked capability changes across successive model generations, finding that AI models could complete apprentice-level cybersecurity tasks about 50 percent of the time by mid-2025, compared to just over 10 percent in early 2024. A model tested in 2025 was the first to successfully complete expert-level cyber tasks typically requiring over ten years of human professional experience.
The UK also led the creation of the AI Safety Institute International Network at the AI Seoul Summit in May 2024, connecting safety institutes across the US, UK, EU, Japan, Singapore, Canada, France, Kenya, Australia, South Korea, and other countries.
California has been active in US state-level frontier AI governance:
As of May 2026, the following models are considered to sit at or near the frontier of AI capabilities across major capability dimensions. The list mixes proprietary flagships from the largest US labs with open-weight challengers from DeepSeek, Alibaba, and Mistral, reflecting a roster that now spans several countries and both closed and open release strategies.
| Model | Developer | Initial release | Key capabilities | Context window | Release model |
|---|---|---|---|---|---|
| Claude Opus 4.8 | Anthropic | May 2026 | Hybrid reasoning, agentic coding, vision, effort controls, faster/cheaper mode | 1 million tokens | Proprietary[36] |
| GPT-5.5 | OpenAI | April 2026 | Agentic workflows, scientific research, code, multimodal (text/images/audio/video) | 1 million tokens | Proprietary[34] |
| Gemini 3.1 Pro | Google DeepMind | February 2026 | Natively multimodal, 77.1% on ARC-AGI-2, strong coding and finance agents | 1 million tokens | Proprietary[38] |
| Grok 4.3 | xAI | May 2026 | Real-time knowledge, reasoning, native video input, slide generation | 1 million tokens | Proprietary[39] |
| Muse Spark | Meta | April 2026 | First Meta Superintelligence Labs model; parallel reasoning; compute-efficient | 262,000 tokens | Proprietary[41] |
| Qwen3.7-Max | Alibaba | May 2026 | Long-horizon agent (1,000+ tool calls), reasoning; top-ranked Chinese model | 1 million tokens | Proprietary[43] |
| DeepSeek V4 | DeepSeek | April 2026 | Open-weight MoE, 1.6T parameters (49B activated in V4-Pro), top coding and STEM among open models | 1 million tokens | Open weights (MIT)[37] |
| Mistral Large 3 | Mistral | December 2025 | Largest open-weight MoE (675B total / 41B active); multilingual; vision | 256,000 tokens | Open weights (Apache 2.0)[44] |
Claude Opus 4.8, released on May 28, 2026, is Anthropic's most capable generally available model as of mid-2026. It is a hybrid reasoning model with a 1 million token context window and a 128,000 token maximum output. The release added user-facing effort controls that let callers decide how much computation Claude devotes to a response, a "dynamic workflow" feature that runs multiple subagents in parallel, and a faster mode that Anthropic states runs at roughly 2.5 times the speed and about one third of the cost of prior fast modes. Anthropic positioned it as improving on Opus 4.6 while fixing the comment-verbosity and tool-calling issues reported with Opus 4.7. Pricing is $5 per million input tokens and $25 per million output tokens, available on the Claude Platform, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.[36]
Claude Opus 4.7 was released on April 16, 2026. It introduced several new features: task budgets, which give the model a rough estimate of how many tokens to target for a full agentic loop including thinking, tool calls, and final output; high-resolution image support up to 2,576 pixels / 3.75 megapixels (increased from 1,568 pixels in prior versions); a new tokenizer; and a 128,000 token maximum output length. The model supports adaptive thinking and is available through the Anthropic API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. Pricing is $5 per million input tokens and $25 per million output tokens, unchanged from Opus 4.6.
GPT-5.5 was released by OpenAI on April 23, 2026. The model processes text, images, audio, and video in a single unified architecture. OpenAI described it as matching GPT-5.4 per-token latency in real-world serving while performing at a substantially higher level of intelligence. OpenAI reported that GPT-5.5 scored 51.7 percent on FrontierMath tiers 1-3 and 35.4 percent on the tier 4 research-level problems, with the GPT-5.5 Pro configuration reaching 52.4 percent on tiers 1-3 to lead the leaderboard as of late April 2026. The model also scored 82.7 percent on Terminal-Bench 2.0, an agentic command-line benchmark. OpenAI released a system card documenting safety evaluations across cybersecurity, CBRN, and persuasion risk categories.[34] A streamlined variant, GPT-5.5 Instant, was released on May 5, 2026.
Gemini 3.1 Pro was released by Google DeepMind in preview on February 19, 2026. It is an iteration of the Gemini 3 Pro that Google released in late 2025. The model achieved a verified score of 77.1 percent on ARC-AGI-2, more than double the performance of Gemini 3 Pro. Gemini 3.1 Pro includes improved agentic behavior in domains such as finance and spreadsheet applications, and supports a 1 million token context window at $2 per million input tokens and $12 per million output tokens.[38]
Grok 4 was the initial release from xAI in July 2025, trained on a 200,000 GPU cluster with reinforcement learning at pretraining scale. Grok 4.20 launched in beta on February 17, 2026, introducing a 2 million token context window and a 16-agent "Heavy" system. Grok 4.3 entered beta on grok.com and the SuperGrok apps on April 17, 2026, reached the public API on April 30, and rolled out broadly during the week of May 4, 2026, adding native video input, presentation slide generation, and enhanced long-context processing with a 1 million token context window. xAI describes it as its most intelligent and fastest model; independent testing by Artificial Analysis placed it at 53 on its Intelligence Index, below GPT-5.5 (60) and Gemini 3.1 Pro (57) but at a fraction of their price. xAI is training Grok 5, targeting a public beta in mid-2026.[39]
DeepSeek V4 launched in preview in April 2026 with two MoE variants: DeepSeek-V4-Pro (1.6 trillion total parameters, 49 billion activated) and DeepSeek-V4-Flash (284 billion total parameters, 13 billion activated). Both support a 1 million token context window. The model pairs token-wise compression with DeepSeek Sparse Attention (DSA), which the developer states sharply reduces the compute and memory cost of long-context inference relative to DeepSeek-V3. V4-Pro leads open-weight models in world knowledge, mathematics, STEM, and coding benchmarks. Both variants were released under the permissive MIT license, and DeepSeek's open-weight strategy has positioned V4 as a challenger to closed frontier models at a fraction of their development cost, continuing the pattern established by DeepSeek-V2 and V3.[37]
Muse Spark, released by Meta on April 8, 2026, was the first model from Meta Superintelligence Labs (MSL), the division led by Chief AI Officer Alexandr Wang following Meta's reorganization of its AI research and Llama development teams. Muse Spark is a closed-source model and an architectural departure from the open-weight Llama family that defined Meta's earlier strategy, a shift that drew commentary because Meta had built much of its AI reputation on open releases. Meta reported that Muse Spark reaches its reasoning capabilities using more than an order of magnitude less compute than Llama 4 Maverick, its previous mid-size flagship, and that it placed fourth on the Artificial Analysis Intelligence Index v4.0 behind Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6.[41]
Qwen3.7-Max, released by Alibaba Cloud on May 19-20, 2026, is a proprietary, closed-weight reasoning model with a 1 million token context window and a native extended-thinking mode. Alibaba describes it as an agent-first model built for long-horizon work, citing an internal test in which the model sustained autonomous execution for up to 35 hours and performed more than 1,000 tool calls and iterative code edits. With a score of 56.6 on the Artificial Analysis Intelligence Index v4.0, it was the highest-ranked Chinese model on that index and entered the global top tier alongside Claude Opus 4.7 and GPT-5.5.[43]
Mistral Large 3, released by Mistral AI in December 2025, is the largest open-weight mixture-of-experts model from a major Western lab, with 675 billion total parameters and 41 billion active parameters, released under the Apache 2.0 license. It supports a 256,000 token context window, processes both text and images, and was trained with particular emphasis on non-English languages. Its open licensing and pricing of roughly $0.50 per million input tokens and $1.50 per million output tokens position it as a lower-cost alternative to closed frontier models for production workloads.[44]
Evaluating frontier model capabilities is a rapidly evolving field. Standard academic benchmarks such as MMLU (massive multitask language understanding) and GSM8K (grade school mathematics) have been largely saturated by frontier models, prompting the development of harder evaluation suites. The principal evaluation frameworks in use as of 2026 include:
| Evaluation | Developer | Focus | Frontier model performance (2026) |
|---|---|---|---|
| FrontierMath | Epoch AI | Expert-level mathematics | GPT-5.5 Pro: 52.4%; GPT-5.5: 51.7% |
| ARC-AGI-2 | ARC Prize | Novel logic pattern recognition | Gemini 3.1 Pro: 77.1%; frontier models generally 40-80% |
| SWE-bench Verified | Independent | Real-world software engineering | GPT-5: 74.9%; top frontier models 65-80% |
| METR Time Horizon | METR | Autonomous task completion duration | Claude Opus 4.6: ~14.5 hour tasks; GPT-5.2: ~6.6 hour tasks |
| GPQA Diamond | Independent | Graduate-level science (physics, chemistry, biology) | Frontier models 60-70%; human PhD experts ~65% |
| CyberGym | OpenAI / AISI | Cybersecurity capabilities | GPT-5.5: state-of-the-art |
| MMLU | Academic | Multitask language understanding (broad) | Frontier models: 86-92%; largely saturated |
| HealthBench Hard | OpenAI | Medical reasoning | GPT-5: 46.2% |
Model Evaluation and Threat Research (METR) developed a "time horizon" methodology that measures the duration of tasks (measured by how long a human expert would take) at which an AI agent can complete them successfully at a specified reliability threshold. The 50-percent time horizon is the task duration at which an agent succeeds half the time.
METR published their original time-horizon dataset in March 2025, showing that the frontier time-horizon doubled approximately every seven months over the period 2019 to 2025. Time Horizon 1.1, released in January 2026, expanded the task suite by 34 percent (from 170 to 228 tasks) and doubled the number of tasks lasting eight hours or more. As of early 2026, Claude Opus 4.6 leads at approximately 14.5 hours, meaning the model can reliably complete tasks that would occupy a skilled human professional for more than half a working day.
The UK AI Security Institute's Frontier AI Trends Report (2025) drew on evaluations of more than 30 frontier models released between 2022 and October 2025. Key findings included:
Anthropic's Responsible Scaling Policy (RSP) is a self-imposed governance framework that ties the development and deployment of frontier models to demonstrated safety standards. Anthropic introduced the RSP in September 2023 as version 1.0, and has updated it several times since.
The RSP organizes models into AI Safety Levels (ASLs) analogous to biosafety levels:
The RSP requires Anthropic to pause deployment of a model that crosses into a higher ASL tier until adequate protections are in place. Critics have noted that the company itself determines when those protections are adequate, and some evaluators have argued the policy lacks external enforcement mechanisms.
Version 3.0 of the RSP, effective February 24, 2026, introduced several new accountability measures:
OpenAI published its Preparedness Framework in beta in December 2023 and updated it substantially in April 2025 (version 2). The framework is a structured process for tracking, evaluating, and preparing for catastrophic risks from frontier AI capabilities.
The framework evaluates models across four risk categories:
| Category | Description | Critical threshold example |
|---|---|---|
| Cybersecurity | Ability to enable significant offensive cyber operations | End-to-end development of novel malware capable of critical infrastructure damage |
| CBRN | Uplift for chemical, biological, radiological, or nuclear weapons | Meaningful technical assistance with synthesis of biological agents capable of mass casualties |
| Persuasion | Ability to influence beliefs at societal scale | Campaigns indistinguishable from human-created content that measurably shift political opinion |
| Model autonomy | Self-replication, resource acquisition, and ability to undermine oversight | Ability to copy model weights, acquire compute, and take actions that prevent shutdown |
Models are assigned risk levels (low, medium, high, critical) based on evaluations in each category. OpenAI's policy states that models scoring "high" or above in any category require additional safeguards before deployment; models scoring "critical" in any category cannot be deployed. An internal Safety Advisory Group (SAG), a cross-functional group of OpenAI leaders, oversees the framework.
A provision in the framework allowing OpenAI to lower safety requirements if a competitor releases a system with "High" or "Critical" capability levels has drawn criticism from safety researchers, who argue it institutionalizes competitive pressure as grounds for relaxing safety standards.
Version 2 of the framework, released April 2025, sharpened the focus on specific risk categories, strengthened the definition of what it means to "sufficiently minimize" risk in practice, and added clearer operational guidance on evaluation, governance, and disclosure.
Frontier safety frameworks can in principle lead a developer to withhold a model from general release. The clearest example to date is Claude Mythos, which Anthropic previewed on April 7, 2026. Anthropic described Mythos as a capability tier above Claude Opus 4.7 and reported benchmark results including 93.9 percent on SWE-bench Verified. Rather than releasing it broadly, Anthropic restricted access through an invitation-only partner program (referred to as Project Glasswing) for a small set of vetted organizations and critical-infrastructure operators, at partner pricing several times that of Opus 4.7.[40]
The restriction was driven primarily by offensive-cybersecurity capability. Anthropic reported using Mythos to autonomously discover thousands of previously unknown ("zero-day") software vulnerabilities across major operating systems and web browsers, and the UK AI Security Institute reported that the model succeeded on expert-level hacking tasks at a rate that no AI system could achieve as recently as April 2025. These claims originate with Anthropic and the evaluating institute and had not, as of May 2026, been independently reproduced at large scale, but the episode is frequently cited as a concrete instance of a lab declining to release a frontier model on safety grounds.
FrontierMath is a benchmark of hundreds of original mathematics problems created and vetted by expert mathematicians, released in November 2024 by Epoch AI. The benchmark is designed to measure the mathematical reasoning capabilities of frontier AI models in a way that is resistant to contamination from training data, because the problems are novel and not derived from existing published problem sets.
Problems are drawn from most major branches of modern mathematics, including number theory, real analysis, algebraic geometry, and category theory. The benchmark is organized into four difficulty tiers:
Models submit answers as executable Python code, meaning scores reflect mathematical reasoning with access to computational tools rather than pen-and-paper derivations. Fields Medalists Terence Tao, Timothy Gowers, and Richard Borcherds, along with International Mathematical Olympiad coach Evan Chen, contributed to the benchmark's design and verification.
When FrontierMath was released in November 2024, state-of-the-art models solved under two percent of problems. By April 2026, performance had increased dramatically:
| Model | FrontierMath score (tiers 1-3) | Tier 4 score |
|---|---|---|
| GPT-5.5 Pro | 52.4% | ~30% |
| GPT-5.5 | 51.7% | ~30% |
| GPT-5.4 Pro | 50.0% | ~28% |
| Claude Opus 4.6 | Over 40% | Over 30% |
| GPT-5.2 | Over 40% | Over 30% |
The rapid improvement in FrontierMath scores over roughly 18 months has been cited as evidence of fast-moving capability gains in mathematical reasoning, though some researchers note that improvements in post-training methods (extended chain-of-thought, tool use, and reinforcement learning on verifiable outcomes) may account for a larger share of these gains than scaling of pre-training compute alone.
| Risk category | Description | Examples | Mitigation approaches |
|---|---|---|---|
| Emergent capabilities | Unexpected abilities appearing at scale | In-context learning, tool use, deception | Comprehensive evaluation, capability discovery research[9] |
| Hallucination | Generation of false or misleading information | Fabricated citations, incorrect facts | Improved training, retrieval augmentation, uncertainty quantification |
| Jailbreaking | Bypassing safety constraints | Harmful content generation, misuse instructions | Adversarial training, constitutional classifiers, robust safety layers |
| Loss of control | Models acting beyond intended parameters | Reward hacking, mesa-optimization | Alignment research, interpretability, shutdown mechanisms |
| Risk | Impact | Affected groups |
|---|---|---|
| Labor displacement | Automation of cognitive work | Knowledge workers, creative professionals |
| Economic concentration | Market dominance by few companies | Smaller firms, developing nations |
| Bias amplification | Perpetuation of historical prejudices | Marginalized communities |
| Democratic erosion | Manipulation of public discourse | Citizens, democratic institutions |
| Environmental impact | Massive energy consumption for training and inference | Global climate, local communities[31] |
| Aspect | Current scale (2026) | Projected (2028) |
|---|---|---|
| Training compute | 10^25 - 10^27 FLOPs | 10^27 - 10^29 FLOPs |
| Training cost | $100M - $1B+ | $1B - $10B |
| Training duration | 3-6 months | 6-12 months |
| GPU requirements | 10,000 - 100,000+ GPUs | 200,000+ GPUs |
| Energy consumption | 50-500 GWh | 1,000+ GWh |
The following table summarizes key specifications and benchmark positions for the leading frontier models as of May 2026:
| Model | Developer | Context window | Key benchmark strengths | Safety framework | Availability |
|---|---|---|---|---|---|
| Claude Opus 4.8 | Anthropic | 1M tokens | METR time horizon leader; agentic coding; effort controls | RSP v3.0 (ASL-2/3 threshold) | API, Bedrock, Vertex, Foundry |
| GPT-5.5 | OpenAI | 1M tokens | FrontierMath leader (Pro: 52.4%); AISI cyber evaluation | Preparedness Framework v2 | ChatGPT Plus/Pro; API |
| Gemini 3.1 Pro | Google DeepMind | 1M tokens | ARC-AGI-2: 77.1%; multimodal; finance agents | Google Frontier Safety Framework | Google AI Pro/Ultra; Vertex |
| Grok 4.3 | xAI | 1M tokens | Real-time search; native video input; speed | xAI risk management framework | SuperGrok, Premium+; xAI API |
| Muse Spark | Meta | 262K tokens | Parallel reasoning; compute efficiency | Meta Frontier AI Framework | Meta AI; API |
| Qwen3.7-Max | Alibaba | 1M tokens | Long-horizon agent; top-ranked Chinese model (II 56.6) | Provider self-assessment | Alibaba Cloud Model Studio; API |
| DeepSeek V4-Pro | DeepSeek | 1M tokens | Open-weight STEM and coding leader; MoE efficiency | Open weights; community auditable | Open weights (Hugging Face); API |
| Mistral Large 3 | Mistral | 256K tokens | Largest open-weight MoE; multilingual; vision | Open weights; community auditable | Open weights (Hugging Face); API |
Projections from Epoch AI and other research organizations suggest that training compute for the most advanced frontier models may reach 10^28 FLOPs by 2027 or 2028, roughly three orders of magnitude above the EU AI Act's systemic risk threshold. As the number of models crossing the 10^25 FLOPs threshold grows from roughly 23 in 2024 toward over 100 by 2028, compute-threshold-based regulation faces pressure to recalibrate.
The METR time-horizon doubling trend, if it continues, implies that by the end of 2026 frontier agents could reliably complete tasks requiring over 30 hours of skilled human work. This would represent a qualitative shift from AI as a tool that assists humans on discrete tasks toward AI as a system that can autonomously manage extended multi-step projects.
xAI is training Grok 5, with a public beta expected in mid-2026. Anthropic, OpenAI, and Google are widely expected to release further frontier model iterations through 2026, and the pace of releases has been roughly one major frontier model per major lab per quarter since late 2025. The roster has also broadened beyond the US labs: open-weight models from DeepSeek and Mistral and proprietary models from Alibaba (Qwen) now appear in the global top tier, while Meta's pivot from open Llama releases to the closed Muse Spark illustrates that release strategy itself is becoming a competitive and governance variable.
According to the International Scientific Report on Advanced AI Safety and related publications from major safety research organizations, the highest-priority research areas include:
The EU AI Act's GPAI provisions took effect in August 2025 and established the first binding legal requirements for frontier model providers. As of May 2026, the European AI Office is developing more detailed technical standards in collaboration with industry. The AI Safety Institute International Network continues to expand, with member countries coordinating evaluation methodologies. The Frontier Model Forum's information-sharing agreement, signed in March 2025, represents an early attempt at industry-to-industry and industry-to-government intelligence sharing on frontier capabilities of concern.
The shift from purely voluntary commitments (the 2023 White House voluntary commitments) to binding regulation (the EU AI Act) within roughly two years reflects the speed at which frontier model governance has moved from aspiration to legal obligation. Further regulatory activity at the US federal level and in other major jurisdictions is widely anticipated, though its shape remains contested.