# Foundation models

> Source: https://aiwiki.ai/wiki/foundation_models
> Updated: 2026-06-20
> Categories: AI Models, Artificial Intelligence, Large Language Models
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Large language model](/wiki/large_language_model), [Self-supervised learning](/wiki/self-supervised_learning), [Transfer learning](/wiki/transfer_learning), [Generative artificial intelligence](/wiki/generative_ai)*

## What is a foundation model?

A **foundation model** is a large [machine learning](/wiki/machine_learning) model trained on broad data, generally using [self-supervised learning](/wiki/self-supervised_learning) at scale, that can be adapted to a wide range of downstream tasks.[1] The term was coined in August 2021 by researchers at Stanford's Center for Research on Foundation Models (CRFM), who chose it, in their words, "to underscore their critically central yet incomplete character."[1] The defining technical property is that a single pretrained model can serve many tasks after light adaptation, rather than each task requiring a separately trained model from scratch.[1]

The naming came from a 212-page report written by 114 authors that opened by declaring: "AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks."[1] The report described a class of models, including [BERT](/wiki/bert), [GPT-3](/wiki/gpt-3), [CLIP](/wiki/clip), [DALL-E](/wiki/dall-e), and [AlphaFold](/wiki/alphafold), that play a foundational role in many downstream applications.

Foundation models are now central to almost every major AI system in production. [GPT-4](/wiki/gpt-4), [Claude](/wiki/claude), [Gemini](/wiki/gemini), [LLaMA](/wiki/llama), [Mistral](/wiki/mistral), [Stable Diffusion](/wiki/stable_diffusion), [Whisper](/wiki/whisper), and [Sora](/wiki/sora) are all foundation models in this sense. They are pretrained once at very high cost (often hundreds of millions of dollars in compute) and then deployed across countless products, from chatbots and search engines to medical imaging and protein design.[2]

The term itself was deliberately chosen by Stanford CRFM as both a description and an argument: foundation models are the foundation on which much of modern AI is now built, and the concentration of so much capability into a small number of pretrained checkpoints has consequences that go beyond any single application. The original 212-page Stanford report, written by more than 100 authors and led by Percy Liang and Rishi Bommasani, argued that this paradigm shift introduces both new opportunities (broad capability, transfer learning, [emergence](/wiki/emergent_abilities)) and new risks (homogenization, single points of failure, bias amplification, environmental impact, regulatory blind spots).[1]

## Definition

The most widely cited definition comes from the original Stanford report: "any model that is trained on broad data (generally using self-supervision at scale) that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks."[1] The definition has three load-bearing parts:

1. **Trained on broad data.** The training corpus is general rather than task-specific. For language models this typically means a large fraction of the internet plus books, code repositories, and other heterogeneous text. For vision models it means hundreds of millions of image-text pairs scraped from the web.
2. **Self-supervision at scale.** The model is trained to predict parts of its input from other parts (next-token prediction, masked-token prediction, contrastive prediction across modalities), which removes the need for human-labeled training data and makes web-scale training feasible.
3. **Adaptable to many downstream tasks.** A single trained checkpoint can be specialized via [fine-tuning](/wiki/fine-tuning), instruction tuning, [prompt engineering](/wiki/prompt_engineering), [LoRA](/wiki/lora), retrieval, or in-context learning, rather than requiring a fresh model for each new application.

The Stanford authors explicitly distinguished the term from earlier labels they considered too narrow. "Large language model" was rejected because foundation models are not only about language; they include vision, speech, code, and protein structure. "Self-supervised model" was considered too tied to a particular training objective. "Pretrained model" was rejected because the interesting behaviors emerge after pretraining, not during it. They also chose "foundation" over "foundational" to avoid implying that these models embody fundamental scientific principles, when in fact they are more like load-bearing infrastructure.[1]

Not every large neural network is a foundation model. A model trained only for image classification on ImageNet, even if it has hundreds of millions of parameters, is not a foundation model in the Stanford sense because it was trained on a narrow dataset for a single task. The defining feature is breadth, both in training data and in downstream applicability.

### How do foundation models differ from large language models?

Foundation models and [large language models](/wiki/large_language_model) overlap heavily but are not the same. All LLMs that are trained on broad data and then adapted to many downstream tasks (which is almost all of them today) are foundation models, but not all foundation models are LLMs. CLIP is a foundation model that processes images and text. AlphaFold is a foundation model for protein structures. [Whisper](/wiki/whisper) is an audio foundation model. Foundation models is the broader category; LLMs are the language-only subset that has dominated public attention since [ChatGPT](/wiki/chatgpt).

### How do foundation models differ from frontier models?

The terms "foundation model" and "[frontier model](/wiki/frontier_models)" are related but not interchangeable. Foundation model names a paradigm: any model trained on broad data and adaptable to many tasks, regardless of how capable it is. Frontier model names a moving subset: the small number of the most capable general-purpose models at the cutting edge of the field at any given moment, the ones whose training compute and capabilities exceed everything that came before. Every frontier model (for example GPT-5, Gemini 3, or Claude Opus 4.5) is a foundation model, but the great majority of foundation models, including older or smaller systems like BERT, the original CLIP, or a 7-billion-parameter open model, are not frontier models. Policy frameworks tend to single out the frontier subset for the heaviest obligations, because that is where the largest capability jumps and the most acute safety questions concentrate.

## History

### Origins of the paradigm

The technical ingredients of foundation models predate the term by several years. Self-supervised pretraining followed by task-specific fine-tuning was already common in NLP by 2018, with [Word2vec](/wiki/word2vec) (2013) and [GloVe](/wiki/glove) (2014) representing earlier static-embedding versions of the same idea. The [Transformer](/wiki/transformer) architecture introduced by Vaswani et al. in 2017 made it practical to scale these methods, and 2018 saw the release of OpenAI's [GPT](/wiki/generative_pre-trained_transformer) and Google's [BERT](/wiki/bert), both of which pretrained large transformer models on broad text corpora and adapted them to dozens of downstream tasks.[3][4]

What changed in 2020 was scale. OpenAI's [GPT-3](/wiki/gpt-3), with 175 billion parameters, demonstrated that sufficiently large pretrained models could perform many tasks zero-shot or with only a handful of examples in the prompt, with no fine-tuning at all. The same year, [CLIP](/wiki/clip) and [DALL-E](/wiki/dall-e) showed that the same paradigm worked across modalities, and Google DeepMind's [AlphaFold 2](/wiki/alphafold) showed that it worked for scientific problems like protein structure prediction.[5]

### Who coined the term, and when?

In August 2021, more than 100 researchers from Stanford's Center for Research on Foundation Models published a 212-page report titled "On the Opportunities and Risks of Foundation Models" (arXiv:2108.07258), submitted on 16 August 2021. The report was led by Rishi Bommasani and Drew A. Hudson, with Percy Liang as senior author and director of CRFM. In total it listed 114 authors (Bommasani plus 113 others) drawn from computer science, linguistics, law, education, philosophy, medicine, and the social sciences.[1]

The report did three things at once. It named a phenomenon that researchers had been talking around ("large pretrained models") with a single term. It catalogued what those models could do across modalities, tasks, and disciplines. And it raised a long list of concerns about the social, economic, and scientific consequences of building so much downstream capability on top of a small number of opaque pretrained checkpoints.[1]

The choice of name was immediately controversial. Some researchers felt that "foundation model" was a rebrand of "large language model" intended to give Stanford a flag to plant in the ground. Emily Bender, a linguist at the University of Washington and a vocal critic of large language models, argued that the term implied a stability and reliability the underlying models had not earned.[6] Others objected that the term obscured the fact that these systems were being deployed before their failure modes were understood. Despite the criticism, the term stuck. Within two years it was used routinely in academic papers, industry product launches, and government policy documents.

### Mainstreaming (2022 to 2024)

The November 2022 release of [ChatGPT](/wiki/chatgpt), built on OpenAI's GPT-3.5 foundation model, brought foundation models into mainstream public awareness. The August 2022 release of [Stable Diffusion](/wiki/stable_diffusion) did the same for image generation. By 2023, foundation models were the explicit subject of regulatory attention from the European Union, the United States, the United Kingdom, China, and other jurisdictions.[7]

The 2023 to 2024 period also saw the rise of open-weight foundation models. Meta's [LLaMA](/wiki/llama) (February 2023), Llama 2 (July 2023), and Llama 3 (April 2024), along with [Mistral](/wiki/mistral)'s open releases starting in late 2023, made it possible for smaller organizations to deploy and adapt foundation models without relying on commercial APIs.[8] Llama 3, released on 18 April 2024, was pretrained on more than 15 trillion tokens, more than seven times the data used for Llama 2.[20]

By 2025, the term had largely displaced "large language model" in policy and research contexts as the most common name for the underlying class of systems, while "LLM" remained the more common consumer-facing term. The EU AI Act, which entered into force in August 2024, regulates these systems under the legal label "general-purpose AI models" (GPAI), which is the European policy equivalent of the foundation model concept.[7]

## Properties

The Stanford report identified two properties that distinguish foundation models from earlier paradigms in machine learning: emergence and homogenization. As the report put it, "their scale results in new emergent capabilities, and their effectiveness across so many tasks incentivizes homogenization."[1] Both are double-edged.

### Emergence

Emergence is the observation that quantitative increases in scale (parameters, data, compute) produce qualitatively new capabilities that are absent in smaller models and difficult to predict in advance. The canonical paper on emergence is Wei et al. 2022, "Emergent Abilities of Large Language Models" (arXiv:2206.07682), which defined the concept precisely: "We consider an ability to be emergent if it is not present in smaller models but is present in larger models."[9] The paper catalogued more than 100 tasks where performance was at chance level for small models and only became non-trivial above a certain scale threshold.[9]

Examples of emergent capabilities include multi-step arithmetic, instruction following without fine-tuning, [chain-of-thought](/wiki/chain_of_thought) reasoning, and the ability to learn new tasks from a few examples in the prompt (in-context learning). None of these were targeted training objectives; they appeared as a side effect of scaling next-token prediction on large corpora.

The emergence claim has been contested. A 2023 paper by Schaeffer, Miranda, and Koyejo ("Are Emergent Abilities of Large Language Models a Mirage?") argued that many apparent emergent abilities are artifacts of choosing discontinuous evaluation metrics, and that smooth metrics show continuous improvement with scale. The empirical question of which capabilities are genuinely emergent and which are measurement artifacts is still active.[10]

### Homogenization

Homogenization is the observation that a small number of foundation models are increasingly used as the substrate for a vast number of downstream applications. In 2018, a typical NLP application was built by training a custom model on a custom dataset. In 2025, a typical NLP application is built by sending prompts to one of perhaps a dozen foundation models, sometimes with fine-tuning or retrieval but rarely with anything resembling from-scratch training.[1]

Homogenization has two consequences. First, improvements to the foundation model propagate to every downstream application that uses it. Second, defects in the foundation model also propagate to every downstream application. As the Stanford report warned, "Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream."[1] The same biases, factual errors, security vulnerabilities, or political tendencies appear in thousands of products built on the same base.

The Stanford report framed homogenization as a single point of failure problem at civilizational scale. If a small number of organizations control the foundation models that underpin most AI applications, those organizations have outsized influence over downstream behavior, and any catastrophic failure (a security breach, a discovered bias, a regulatory shutdown) propagates broadly.[1]

## Examples by modality

Foundation models exist across many data types, and many of the most capable are [multimodal models](/wiki/multimodal_model) that handle several modalities at once. The following table lists notable foundation models grouped by primary modality.

| Modality | Model | Year | Developer | Notable feature |
|----------|-------|------|-----------|-----------------|
| Text | [GPT-3](/wiki/gpt-3) | 2020 | [OpenAI](/wiki/openai) | 175B params, demonstrated few-shot learning at scale |
| Text | [BERT](/wiki/bert) | 2018 | Google | First widely used bidirectional transformer foundation model |
| Text | [PaLM](/wiki/palm) | 2022 | Google | 540B params, strong reasoning and multilingual capability |
| Text | [LLaMA](/wiki/llama) | 2023 | [Meta AI](/wiki/meta_ai) | First widely released open-weight foundation model at scale |
| Text | [Mistral 7B](/wiki/mistral) | 2023 | [Mistral AI](/wiki/mistral_ai) | Open-weight model with strong performance per parameter |
| Text | BLOOM | 2022 | BigScience | 176B params, trained openly across 46 languages |
| Vision-language | [CLIP](/wiki/clip) | 2021 | [OpenAI](/wiki/openai) | Contrastive image-text pretraining, zero-shot classification |
| Vision-language | Flamingo | 2022 | [DeepMind](/wiki/deepmind) | 80B params, few-shot visual question answering |
| Vision-language | LLaVA | 2023 | UW / Microsoft | Open multimodal model built on LLaMA and CLIP |
| Image generation | [DALL-E](/wiki/dall-e) | 2021 | [OpenAI](/wiki/openai) | First widely known text-to-image foundation model |
| Image generation | [Stable Diffusion](/wiki/stable_diffusion) | 2022 | [Stability AI](/wiki/stability_ai) | Open-weight latent diffusion model |
| Image generation | Imagen | 2022 | Google | Photorealistic text-to-image generation |
| Audio | [Whisper](/wiki/whisper) | 2022 | [OpenAI](/wiki/openai) | Multilingual speech recognition trained on 680K hours of audio |
| Audio | MusicLM | 2023 | Google | Text-to-music generation |
| Video | [Sora](/wiki/sora) | 2024 | [OpenAI](/wiki/openai) | Text-to-video diffusion model with up to one-minute outputs |
| Video | Veo | 2024 | Google DeepMind | High-resolution video generation |
| Code | Codex | 2021 | [OpenAI](/wiki/openai) | Foundation for [GitHub Copilot](/wiki/github_copilot) |
| Code | StarCoder | 2023 | BigCode | Open-weight code generation model trained on permissively licensed code |
| Science | [AlphaFold 2](/wiki/alphafold) | 2021 | DeepMind | Protein structure prediction at near-experimental accuracy |
| Science | AlphaFold 3 | 2024 | DeepMind / Isomorphic Labs | Extends prediction to proteins with DNA, RNA, and ligands[14] |
| Science | RoseTTAFold | 2021 | Baker Lab | Open alternative for protein structure |
| Science | ESM-2 | 2022 | [Meta AI](/wiki/meta_ai) | Protein language model trained on hundreds of millions of sequences |
| Robotics | RT-2 | 2023 | Google DeepMind | Vision-language-action (VLA) model that transfers web knowledge to robot control |
| Robotics | OpenVLA | 2024 | Stanford and others | 7B open-weight VLA trained on the Open X-Embodiment dataset |
| Robotics | pi0 (pi-zero) | 2024 | Physical Intelligence | Generalist VLA flow model for dexterous robot control |
| Robotics | Gemini Robotics | 2025 | Google DeepMind | VLA built on Gemini 2.0 to act in the physical world |
| Science (weather) | GraphCast | 2023 | Google DeepMind | Medium-range global weather forecasting, faster than numerical models |
| Science (weather) | Aurora | 2024 | Microsoft | Large-scale foundation model of the atmosphere, fine-tuned to many tasks |
| Multimodal | [GPT-4](/wiki/gpt-4) | 2023 | [OpenAI](/wiki/openai) | Multimodal frontier model |
| Multimodal | [Claude](/wiki/claude) 3 | 2024 | [Anthropic](/wiki/anthropic) | Frontier model with vision input |
| Multimodal | [Gemini](/wiki/gemini) | 2023 | Google DeepMind | Built natively multimodal across text, images, audio, video |

This is a partial list. By 2025 there were several hundred publicly named foundation models in active use, and many more proprietary ones used inside individual organizations.

Two of the fastest-moving frontiers are science and robotics. In science, the foundation model recipe of broad pretraining plus task adaptation has spread well beyond protein folding: [AlphaFold 3](/wiki/alphafold) (2024) predicts how proteins interact with DNA, RNA, and small-molecule drugs, and a new class of atmospheric foundation models, including DeepMind's GraphCast and GenCast and Microsoft's Aurora, now matches or beats traditional numerical weather prediction while running orders of magnitude faster.[14][15] In robotics, vision-language-action (VLA) models extend the paradigm to physical control: a single model takes camera images and a natural-language instruction and outputs robot actions. RT-2 (2023) established the approach, and it has since been pushed forward by open models such as OpenVLA, generalist systems from startups like Physical Intelligence's pi0, and Google DeepMind's Gemini Robotics, which adapts a general multimodal foundation model to embodied tasks.[16]

## Training

Training a foundation model is dominated by three resources: data, compute, and engineering effort.

### Data

Text foundation models are pretrained on corpora ranging from hundreds of billions to many trillions of tokens. Common ingredients include the Common Crawl web scrape, English and multilingual Wikipedia, the Books3 corpus, scientific papers from arXiv and PubMed, code from GitHub, and curated dialogue and instruction data. The largest published text corpora as of 2025 contained on the order of 15 trillion tokens; Meta's Llama 3 (April 2024) was trained on more than 15 trillion tokens of publicly available data.[8][20]

Vision-language models are pretrained on image-text pairs scraped from the web. CLIP was trained on 400 million pairs; later models have used datasets in the billions. Quality, deduplication, and filtering matter as much as raw size; a smaller cleaner dataset often outperforms a larger noisier one for the same compute budget.[5]

Data provenance has become a contentious topic. Many text and image corpora include copyrighted works, personal data scraped without consent, and material whose authors did not anticipate AI training. Lawsuits over training data are pending in multiple jurisdictions, and several model developers have begun licensing data from publishers in part to reduce legal exposure.

### Compute

Foundation model training is compute-intensive. GPT-3 was estimated to require around 3,640 petaflop-days of compute in 2020. By 2024, frontier models were estimated to use more than 100 times that. Training the largest publicly disclosed foundation models in 2025 cost in the high tens to low hundreds of millions of dollars, mostly in GPU rental or amortized hardware costs.[2]

The compute used to train notable AI models has roughly doubled every six months since 2010, far outpacing Moore's law, and for frontier models specifically it has grown about 5 times per year since 2020.[19] This trend has driven enormous capital investment into AI infrastructure: Nvidia GPUs, custom accelerators (TPUs at Google, Trainium at Amazon, MTIA at Meta), and the construction of dedicated data centers with multi-gigawatt power requirements.

### Self-supervision

The defining training technique for foundation models is self-supervision: the model learns by predicting parts of its input from other parts, with no need for human-provided labels. Specific objectives vary by modality:

| Modality | Common objective | Example models |
|----------|------------------|----------------|
| Text | Next-token prediction (autoregressive) | GPT, LLaMA, Claude, Gemini |
| Text | Masked-token prediction | BERT, RoBERTa |
| Vision-text | Contrastive matching of image-text pairs | CLIP, ALIGN |
| Image | Denoising diffusion | Stable Diffusion, DALL-E 3, Imagen |
| Image | Masked-patch prediction | MAE, BEiT |
| Audio | Masked acoustic prediction; weak supervision | wav2vec 2.0, Whisper |
| Protein | Masked-amino-acid prediction | ESM-2 |

Self-supervision is what unlocks scale. Because no human labeling is required, training data can be collected cheaply at web scale, and the model learns general representations that transfer to many downstream tasks.

## Adaptation methods

A pretrained foundation model is rarely deployed as-is. Almost every production use involves some form of adaptation to a specific task, domain, or behavioral profile.

| Method | What it changes | Cost | Typical use |
|--------|-----------------|------|-------------|
| Full [fine-tuning](/wiki/fine-tuning) | All model weights | High (full training) | Domain adaptation, behavior tuning |
| [Instruction tuning](/wiki/instruction_tuning) | All weights, on instruction-response pairs | Moderate | Making a base model follow instructions |
| [RLHF](/wiki/reinforcement_learning_from_human_feedback) | Weights, via human-preference reward model | High | Aligning outputs with user intent |
| [DPO](/wiki/direct_preference_optimization_dpo) | Weights, directly on preference data | Moderate | Lighter-weight alternative to RLHF |
| [LoRA](/wiki/lora) and adapters | Small low-rank deltas added to weights | Low | Parameter-efficient fine-tuning |
| Prefix and prompt tuning | A small set of soft tokens prepended to input | Very low | Lightweight task adaptation |
| In-context learning | Nothing (only the prompt) | None | Few-shot tasks at inference time |
| [Prompt engineering](/wiki/prompt_engineering) | Nothing (only the prompt phrasing) | None | Steering frozen models without retraining |
| [Retrieval-augmented generation](/wiki/retrieval_augmented_generation) | Adds a retrieval step before generation | Moderate | Grounding outputs in external knowledge |
| Tool use and function calling | Adds external API calls during generation | Moderate | Giving models access to calculators, search, code execution |

Fine-tuning, in its pure form, is increasingly rare for the largest commercial foundation models because the weights are not released. Most adaptation today is parameter-efficient (LoRA, adapters, soft prompts) or training-free (prompts, retrieval, tools). Open-weight models like LLaMA and Mistral support the full range of methods.

### In-context learning

A distinctive property of large foundation models is in-context learning: the model learns a new task at inference time, from a few examples included in the prompt, with no weight updates. GPT-3 was the first model to make this property widely visible, and it has since been characterized in many follow-up papers. The mechanism is still imperfectly understood; it appears to depend on scale, training data composition, and the specific structure of the task.[5]

## Concerns

The Stanford report devoted substantial space to risks. Four families of concerns have remained central.

### Homogenization and single points of failure

Because many downstream applications are built on a small number of foundation models, those few models become critical infrastructure. Defects propagate. If GPT-4 has a particular factual error or political slant or security flaw, every product built on GPT-4 inherits that property. If a frontier model is suddenly withdrawn (regulatory action, vendor decision, security incident), every dependent application is affected. The pre-foundation-model era did not have this single-point property to anything like the same degree.[1]

### Bias amplification

Foundation models trained on web-scraped data inherit the biases of that data. Studies have documented gender, racial, religious, and political biases in outputs across language models, image generators, and multimodal systems. When a single foundation model is used in thousands of downstream applications, its biases are not just reproduced but amplified through scale.[1] Mitigation work includes data filtering, RLHF on bias-related preferences, fine-tuning on counter-stereotyped data, and inference-time filters, none of which fully solve the problem.

### Environmental impact

Training and serving foundation models consumes substantial energy. Estimates of the carbon footprint of training a single large model in 2020 ranged from tens to hundreds of tonnes of CO2 equivalent (Strubell et al. 2019, Patterson et al. 2021). Serving costs at deployment scale now dominate training costs for popular models, and global data center electricity use has risen sharply with the foundation model boom. Water use for cooling and the mining of rare earths for hardware add additional environmental impacts.[1]

### Misuse and dual-use risk

Foundation models lower the cost of producing convincing text, images, audio, and video. This has implications for disinformation campaigns, fraud, non-consensual intimate imagery, and the proliferation of malware. The largest models also raise concerns about uplift in the production of biological, chemical, radiological, and nuclear weapons (CBRN), and about cyber offensive capability. Governments have introduced reporting requirements, evaluation regimes, and pre-deployment safety testing for the most capable systems in part to address these concerns.[7][11]

## How are foundation models evaluated?

Evaluation of foundation models is a research field of its own. The leading benchmark suite is HELM (Holistic Evaluation of Language Models), introduced by Liang, Bommasani, and colleagues at Stanford CRFM in November 2022 (arXiv:2211.09110). HELM was designed to address the previously fragmented state of language model evaluation, where different models were tested on different benchmarks under different conditions, making fair comparison impossible.[12]

HELM evaluates models along seven dimensions: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. The original release covered 30 prominent language models on 42 scenarios, raising the share of core scenarios with comparable evaluations across all models from 17.9% to 96.0%.[12] HELM is maintained as a living benchmark with periodic releases that add new models, scenarios, and evaluation dimensions.[12]

| HELM dimension | What it measures |
|----------------|------------------|
| Accuracy | Standard task performance |
| Calibration | Whether the model's confidence matches its correctness |
| Robustness | Stability under perturbations to the input |
| Fairness | Disparities in performance across demographic groups |
| Bias | Skew in generated content along social dimensions |
| Toxicity | Production of harmful or offensive output |
| Efficiency | Compute and latency required for inference |

Other notable evaluation frameworks include MMLU (Hendrycks et al. 2020) for academic knowledge, BIG-bench (a community-contributed benchmark with more than 200 tasks), HumanEval and MBPP for code, MT-Bench and Chatbot Arena for instruction-following and dialogue, and various agent-oriented benchmarks like AgentBench and SWE-bench.

## How are foundation models regulated?

Foundation models have become a central object of AI regulation. Several jurisdictions have introduced bespoke rules for them.

### European Union: the AI Act

The EU AI Act, adopted in 2024 and entering into force in August 2024 with phased applicability through 2027, regulates foundation models under the term "general-purpose AI model" (GPAI). A GPAI model is defined roughly as a model trained on broad data using self-supervision at scale that displays significant generality and can be integrated into a wide variety of downstream systems.[7]

All GPAI models are subject to baseline transparency obligations: technical documentation, summaries of training data, and copyright compliance procedures. A subset of GPAI models is classified as having "systemic risk," with substantially heavier obligations including model evaluations, adversarial testing, incident reporting, and cybersecurity protections. As of 2025 the systemic-risk threshold was set at training compute exceeding 10^25 floating-point operations, a level reached by GPT-4o, Gemini Ultra, Claude 3 Opus, Mistral Large 2, and a small number of other models. The systemic-risk obligations entered into application on 2 August 2025.[7]

The European Commission's AI Office is responsible for supervision of GPAI models. A General-Purpose AI Code of Practice, the final version of which was published in July 2025, provides a voluntary structured way for providers to demonstrate compliance with the transparency, copyright, and safety-and-security obligations until formal harmonized standards are published. The core GPAI obligations began to apply on 2 August 2025.[7]

### United States: Executive Order 14110

In the United States, President Biden's Executive Order 14110 of 30 October 2023 introduced the term "dual-use foundation model" as a regulatory category. The definition borrows directly from the Stanford report: a model trained on broad data, generally using self-supervision, containing at least tens of billions of parameters, applicable across a wide range of contexts, and exhibiting (or easily modified to exhibit) high levels of performance at tasks that pose serious risks to security, national economic security, or public health and safety.[11]

The order required developers of dual-use foundation models above specified compute thresholds to report training plans, safety test results, and information about cybersecurity protections to the U.S. government. It also tasked NIST with developing evaluation guidelines and the National AI Research Resource pilot. Executive Order 14110 was rescinded by President Trump on 20 January 2025 as part of a broader rollback of Biden-era AI regulation. Some of its underlying mechanisms (NIST work, voluntary commitments from developers) have continued in modified form; the formal reporting requirements have lapsed.

### Other jurisdictions

The United Kingdom hosted the AI Safety Summit at Bletchley Park in November 2023, which produced a declaration signed by 28 countries on the risks of frontier AI. China's Cyberspace Administration introduced rules on generative AI in 2023 that include pre-deployment safety assessments for foundation models offering services to the public. Canada, South Korea, Japan, Singapore, Brazil, and other countries have developed or are developing comparable frameworks.

## Modern landscape

As of early 2026, a small number of organizations dominate frontier foundation model development. The 2025 release wave brought OpenAI's GPT-5 (August 2025), Meta's open-weight Llama 4 (April 2025), Google DeepMind's Gemini 3 (November 2025), and Anthropic's Claude Opus 4.5 (November 2025), among others.[17][18]

| Developer | Flagship foundation model (2025-2026) | Modality | Release model |
|-----------|---------------------------------------|----------|---------------|
| [OpenAI](/wiki/openai) | GPT-5 series, o-series reasoning models | Multimodal | API and product |
| [Anthropic](/wiki/anthropic) | [Claude](/wiki/claude) Opus 4.5 and the Claude 4 family | Multimodal | API and product |
| Google DeepMind | [Gemini](/wiki/gemini) 3 family | Multimodal | API and product |
| [Meta AI](/wiki/meta_ai) | [LLaMA](/wiki/llama) 4 family | Multimodal | Open weights |
| [Mistral AI](/wiki/mistral_ai) | Mistral Large, Codestral, Pixtral | Multimodal | Mix of open and commercial |
| xAI | Grok | Multimodal | API and product |
| [DeepSeek](/wiki/deepseek) | DeepSeek V3, R1 | Multimodal | Open weights |
| Alibaba | Qwen series | Multimodal | Open weights |
| Baidu | Ernie series | Multimodal | API and product |
| [Stability AI](/wiki/stability_ai) | Stable Diffusion 3, SD video | Image and video | Open weights |
| [Black Forest Labs](/wiki/black_forest_labs) | FLUX | Image | Mix of open and commercial |
| Cohere | Command series | Text | API |
| AI21 Labs | Jamba | Text | Open weights and API |

### Are foundation models open source?

The industry has split, by rough convention, into closed-weight providers (OpenAI, Anthropic, most of Google) and open-weight providers (Meta, Mistral, DeepSeek, many Chinese labs, several smaller Western labs). Both sides have produced state-of-the-art models. Note that "open weights" is not the same as "open source": most open-weight releases publish the trained parameters under a license but withhold the training data, full training code, and methodology. The gap between the best open-weight and best closed-weight models has narrowed considerably between 2023 and 2026.

The foundation model market has also concentrated at the infrastructure layer. Nvidia supplies the overwhelming majority of training accelerators. AWS, Microsoft Azure, and Google Cloud provide the hosting. Scale AI, Surge AI, and a handful of other firms supply the human-labeled data used for instruction tuning and RLHF. This concentration is itself a source of regulatory and competitive concern.[1]

## Transparency

The Foundation Model Transparency Index (FMTI), maintained by Stanford CRFM, scores major foundation model developers on 100 indicators across data, compute, model characteristics, and downstream impact. The first edition in October 2023 found that no developer scored above 54 out of 100, with average scores around 37. Meta's Llama 2 was the highest scorer at 54, with OpenAI's GPT-4 at 48 and Google's PaLM 2 at 40.[13] The 2024 update reported some improvement across 14 developers but persistent gaps in data transparency, labor practices, and downstream usage reporting.[13]

Transparency disputes have been a recurring source of friction. Developers of closed-weight models typically publish little detail about training data, training compute, training methodology, or evaluation results beyond marketing-friendly headline numbers. Open-weight model developers typically publish more, though with significant variation. The push for greater transparency has come from researchers (who need methodological detail to study the systems), regulators (who need information to enforce rules), and downstream developers (who need to understand the models they are building on).

## Criticism of the term

The term "foundation model" has been criticized on several grounds.

Linguist Emily Bender and others have argued that the term is misleading because it suggests a stable, well-understood substrate when in fact these models are opaque, brittle, and behave inconsistently. Calling them "foundations" implies they can safely support load, which has not been demonstrated.[6]

Some researchers have argued that the term is essentially a rebrand of "large language model" designed to give Stanford intellectual ownership of a phenomenon that already had a name. The fact that most discourse about "foundation models" focuses on language models (rather than CLIP, AlphaFold, or Whisper) lends some weight to this critique.

From a different direction, researchers like Yann LeCun have argued that current foundation models are not foundational in any deep sense, because they lack world models, persistent memory, planning, and other capabilities he considers necessary for real intelligence. From this view, treating LLM-style foundation models as the basis for general AI is a category error.

Despite these objections, the term has become entrenched in research, industry, and policy. It is now the standard label in the European Union's AI Act (under the related term "general-purpose AI model"), in the (rescinded) U.S. Executive Order on AI, in NIST glossaries, and in most academic literature published since 2022.

## See also

- [Large language model](/wiki/large_language_model)
- [Multimodal model](/wiki/multimodal_model)
- [Generative artificial intelligence](/wiki/generative_ai)
- [Self-supervised learning](/wiki/self-supervised_learning)
- [Transfer learning](/wiki/transfer_learning)
- [Transformer](/wiki/transformer)
- [Diffusion model](/wiki/diffusion_model)
- [Fine-tuning](/wiki/fine-tuning)
- [Reinforcement learning from human feedback](/wiki/reinforcement_learning_from_human_feedback)
- [LoRA](/wiki/lora)
- [Prompt engineering](/wiki/prompt_engineering)
- [Retrieval-augmented generation](/wiki/retrieval_augmented_generation)
- [Emergent abilities](/wiki/emergent_abilities)
- [Frontier models](/wiki/frontier_models)
- [AI alignment](/wiki/ai_alignment)
- [EU AI Act](/wiki/eu_ai_act)

## References

1. Bommasani, R., Hudson, D. A., Adeli, E., et al. (2021). "On the Opportunities and Risks of Foundation Models." arXiv:2108.07258 (114 authors). Stanford Center for Research on Foundation Models. https://arxiv.org/abs/2108.07258. Accessed 2026-06-19.
2. Cottier, B. et al. (2024). "The rising costs of training frontier AI models." Epoch AI report.
3. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems*, 30.
4. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv:1810.04805.
5. Brown, T., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." *Advances in Neural Information Processing Systems*, 33. arXiv:2005.14165.
6. Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" *Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency*.
7. European Union (2024). "Regulation (EU) 2024/1689 (Artificial Intelligence Act)," Articles 51-55; European Commission, General-Purpose AI Code of Practice (final, July 2025). https://artificialintelligenceact.eu/. Accessed 2026-06-19.
8. Touvron, H., Lavril, T., Izacard, G., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971. Meta AI.
9. Wei, J., Tay, Y., Bommasani, R., et al. (2022). "Emergent Abilities of Large Language Models." arXiv:2206.07682. *Transactions on Machine Learning Research*.
10. Schaeffer, R., Miranda, B., & Koyejo, S. (2023). "Are Emergent Abilities of Large Language Models a Mirage?" arXiv:2304.15004. *NeurIPS 2023*.
11. The White House (2023). "Executive Order 14110 on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence," 30 October 2023. Section 3(k).
12. Liang, P., Bommasani, R., Lee, T., et al. (2022). "Holistic Evaluation of Language Models." arXiv:2211.09110. Stanford Center for Research on Foundation Models. https://crfm.stanford.edu/helm/. Accessed 2026-06-19.
13. Bommasani, R., Klyman, K., Longpre, S., et al. (2023, 2024). "The Foundation Model Transparency Index." Stanford CRFM. arXiv:2310.12941. https://crfm.stanford.edu/fmti/. Accessed 2026-06-19.
14. Abramson, J., Adler, J., Dunger, J., et al. (2024). "Accurate structure prediction of biomolecular interactions with AlphaFold 3." *Nature*, 630, 493-500. https://www.nature.com/articles/s41586-024-07487-w. Accessed 2026-05-31.
15. Bodnar, C., Bruinsma, W. P., Lucic, A., et al. (2025). "A foundation model for the Earth system (Aurora)." *Nature*. https://www.microsoft.com/en-us/research/blog/introducing-aurora-the-first-large-scale-foundation-model-of-the-atmosphere/. Accessed 2026-05-31.
16. Brohan, A., et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." arXiv:2307.15818. https://arxiv.org/abs/2307.15818. Accessed 2026-05-31.
17. Meta AI (2025). "The Llama 4 herd." https://ai.meta.com/blog/llama-4-multimodal-intelligence/. Accessed 2026-05-31.
18. Anthropic (2025). "Introducing Claude Opus 4.5." https://www.anthropic.com/news/claude-opus-4-5. Accessed 2026-05-31.
19. Epoch AI (2024). "The training compute of notable AI models has been doubling roughly every six months." https://epoch.ai/data-insights/compute-trend-post-2010. Accessed 2026-06-19.
20. Meta AI (2024). "Introducing Meta Llama 3: The most capable openly available LLM to date," 18 April 2024. https://ai.meta.com/blog/meta-llama-3/. Accessed 2026-06-19.

## External links

- [Stanford Center for Research on Foundation Models](https://crfm.stanford.edu/)
- [On the Opportunities and Risks of Foundation Models (full report)](https://crfm.stanford.edu/report.html)
- [HELM benchmark](https://crfm.stanford.edu/helm/latest/)
- [Foundation Model Transparency Index](https://crfm.stanford.edu/fmti/)
- [EU AI Act, Article 51 (systemic-risk classification)](https://artificialintelligenceact.eu/article/51/)
- [Hugging Face model hub](https://huggingface.co/)