Foundation models
Last reviewed
Apr 28, 2026
Sources
13 citations
Review status
Source-backed
Revision
v5 ยท 5,038 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 28, 2026
Sources
13 citations
Review status
Source-backed
Revision
v5 ยท 5,038 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Large language model, Self-supervised learning, Transfer learning, Generative artificial intelligence
A foundation model is a large machine learning model trained on broad data, generally using self-supervised learning at scale, that can be adapted to a wide range of downstream tasks.[1] The term was introduced in August 2021 by researchers at Stanford's Center for Research on Foundation Models (CRFM) to describe a class of models, including BERT, GPT-3, CLIP, DALL-E, and AlphaFold, that play a foundational role in many downstream applications. The defining technical property is that a single pretrained model can serve many tasks after light adaptation, rather than each task requiring a separately trained model from scratch.[1]
Foundation models are now central to almost every major AI system in production. GPT-4, Claude, Gemini, LLaMA, Mistral, Stable Diffusion, Whisper, and Sora are all foundation models in this sense. They are pretrained once at very high cost (often hundreds of millions of dollars in compute) and then deployed across countless products, from chatbots and search engines to medical imaging and protein design.[2]
The term itself was deliberately chosen by Stanford CRFM as both a description and an argument: foundation models are the foundation on which much of modern AI is now built, and the concentration of so much capability into a small number of pretrained checkpoints has consequences that go beyond any single application. The original 212-page Stanford report, written by more than 100 authors and led by Percy Liang and Rishi Bommasani, argued that this paradigm shift introduces both new opportunities (broad capability, transfer learning, emergence) and new risks (homogenization, single points of failure, bias amplification, environmental impact, regulatory blind spots).[1]
The most widely cited definition comes from the original Stanford report: "any model that is trained on broad data (generally using self-supervision at scale) that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks."[1] The definition has three load-bearing parts:
The Stanford authors explicitly distinguished the term from earlier labels they considered too narrow. "Large language model" was rejected because foundation models are not only about language; they include vision, speech, code, and protein structure. "Self-supervised model" was considered too tied to a particular training objective. "Pretrained model" was rejected because the interesting behaviors emerge after pretraining, not during it. They also chose "foundation" over "foundational" to avoid implying that these models embody fundamental scientific principles, when in fact they are more like load-bearing infrastructure.[1]
Not every large neural network is a foundation model. A model trained only for image classification on ImageNet, even if it has hundreds of millions of parameters, is not a foundation model in the Stanford sense because it was trained on a narrow dataset for a single task. The defining feature is breadth, both in training data and in downstream applicability.
Foundation models and large language models overlap heavily but are not the same. All LLMs that are trained on broad data and then adapted to many downstream tasks (which is almost all of them today) are foundation models, but not all foundation models are LLMs. CLIP is a foundation model that processes images and text. AlphaFold is a foundation model for protein structures. Whisper is an audio foundation model. Foundation models is the broader category; LLMs are the language-only subset that has dominated public attention since ChatGPT.
The technical ingredients of foundation models predate the term by several years. Self-supervised pretraining followed by task-specific fine-tuning was already common in NLP by 2018, with Word2vec (2013) and GloVe (2014) representing earlier static-embedding versions of the same idea. The Transformer architecture introduced by Vaswani et al. in 2017 made it practical to scale these methods, and 2018 saw the release of OpenAI's GPT and Google's BERT, both of which pretrained large transformer models on broad text corpora and adapted them to dozens of downstream tasks.[3][4]
What changed in 2020 was scale. OpenAI's GPT-3, with 175 billion parameters, demonstrated that sufficiently large pretrained models could perform many tasks zero-shot or with only a handful of examples in the prompt, with no fine-tuning at all. The same year, CLIP and DALL-E showed that the same paradigm worked across modalities, and Google DeepMind's AlphaFold 2 showed that it worked for scientific problems like protein structure prediction.[5]
In August 2021, more than 100 researchers from Stanford's Center for Research on Foundation Models published a 212-page report titled "On the Opportunities and Risks of Foundation Models" (arXiv:2108.07258). The lead authors were Rishi Bommasani and Drew Hudson; Percy Liang directed CRFM. The full author list included more than 200 contributors across computer science, linguistics, law, education, philosophy, medicine, and the social sciences.[1]
The report did three things at once. It named a phenomenon that researchers had been talking around ("large pretrained models") with a single term. It catalogued what those models could do across modalities, tasks, and disciplines. And it raised a long list of concerns about the social, economic, and scientific consequences of building so much downstream capability on top of a small number of opaque pretrained checkpoints.[1]
The choice of name was immediately controversial. Some researchers felt that "foundation model" was a rebrand of "large language model" intended to give Stanford a flag to plant in the ground. Emily Bender, a linguist at the University of Washington and a vocal critic of large language models, argued that the term implied a stability and reliability the underlying models had not earned.[6] Others objected that the term obscured the fact that these systems were being deployed before their failure modes were understood. Despite the criticism, the term stuck. Within two years it was used routinely in academic papers, industry product launches, and government policy documents.
The November 2022 release of ChatGPT, built on OpenAI's GPT-3.5 foundation model, brought foundation models into mainstream public awareness. The August 2022 release of Stable Diffusion did the same for image generation. By 2023, foundation models were the explicit subject of regulatory attention from the European Union, the United States, the United Kingdom, China, and other jurisdictions.[7]
The 2023 to 2024 period also saw the rise of open-weight foundation models. Meta's LLaMA (February 2023), Llama 2 (July 2023), and Llama 3 (April 2024), along with Mistral's open releases starting in late 2023, made it possible for smaller organizations to deploy and adapt foundation models without relying on commercial APIs.[8]
By 2025, the term had largely displaced "large language model" in policy and research contexts as the most common name for the underlying class of systems, while "LLM" remained the more common consumer-facing term. The EU AI Act, which entered into force in August 2024, regulates these systems under the legal label "general-purpose AI models" (GPAI), which is the European policy equivalent of the foundation model concept.[7]
The Stanford report identified two properties that distinguish foundation models from earlier paradigms in machine learning: emergence and homogenization. Both are double-edged.
Emergence is the observation that quantitative increases in scale (parameters, data, compute) produce qualitatively new capabilities that are absent in smaller models and difficult to predict in advance. The canonical paper on emergence is Wei et al. 2022, "Emergent Abilities of Large Language Models" (arXiv:2206.07682), which catalogued more than 100 tasks where performance was at chance level for small models and only became non-trivial above a certain scale threshold.[9]
Examples of emergent capabilities include multi-step arithmetic, instruction following without fine-tuning, chain-of-thought reasoning, and the ability to learn new tasks from a few examples in the prompt (in-context learning). None of these were targeted training objectives; they appeared as a side effect of scaling next-token prediction on large corpora.
The emergence claim has been contested. A 2023 paper by Schaeffer, Miranda, and Koyejo ("Are Emergent Abilities of Large Language Models a Mirage?") argued that many apparent emergent abilities are artifacts of choosing discontinuous evaluation metrics, and that smooth metrics show continuous improvement with scale. The empirical question of which capabilities are genuinely emergent and which are measurement artifacts is still active.[10]
Homogenization is the observation that a small number of foundation models are increasingly used as the substrate for a vast number of downstream applications. In 2018, a typical NLP application was built by training a custom model on a custom dataset. In 2025, a typical NLP application is built by sending prompts to one of perhaps a dozen foundation models, sometimes with fine-tuning or retrieval but rarely with anything resembling from-scratch training.[1]
Homogenization has two consequences. First, improvements to the foundation model propagate to every downstream application that uses it. Second, defects in the foundation model also propagate to every downstream application. The same biases, factual errors, security vulnerabilities, or political tendencies appear in thousands of products built on the same base.
The Stanford report framed homogenization as a single point of failure problem at civilizational scale. If a small number of organizations control the foundation models that underpin most AI applications, those organizations have outsized influence over downstream behavior, and any catastrophic failure (a security breach, a discovered bias, a regulatory shutdown) propagates broadly.[1]
Foundation models exist across many data types. The following table lists notable foundation models grouped by primary modality.
| Modality | Model | Year | Developer | Notable feature |
|---|---|---|---|---|
| Text | GPT-3 | 2020 | OpenAI | 175B params, demonstrated few-shot learning at scale |
| Text | BERT | 2018 | First widely used bidirectional transformer foundation model | |
| Text | PaLM | 2022 | 540B params, strong reasoning and multilingual capability | |
| Text | LLaMA | 2023 | Meta AI | First widely released open-weight foundation model at scale |
| Text | Mistral 7B | 2023 | Mistral AI | Open-weight model with strong performance per parameter |
| Text | BLOOM | 2022 | BigScience | 176B params, trained openly across 46 languages |
| Vision-language | CLIP | 2021 | OpenAI | Contrastive image-text pretraining, zero-shot classification |
| Vision-language | Flamingo | 2022 | DeepMind | 80B params, few-shot visual question answering |
| Vision-language | LLaVA | 2023 | UW / Microsoft | Open multimodal model built on LLaMA and CLIP |
| Image generation | DALL-E | 2021 | OpenAI | First widely known text-to-image foundation model |
| Image generation | Stable Diffusion | 2022 | Stability AI | Open-weight latent diffusion model |
| Image generation | Imagen | 2022 | Photorealistic text-to-image generation | |
| Audio | Whisper | 2022 | OpenAI | Multilingual speech recognition trained on 680K hours of audio |
| Audio | MusicLM | 2023 | Text-to-music generation | |
| Video | Sora | 2024 | OpenAI | Text-to-video diffusion model with up to one-minute outputs |
| Video | Veo | 2024 | Google DeepMind | High-resolution video generation |
| Code | Codex | 2021 | OpenAI | Foundation for GitHub Copilot |
| Code | StarCoder | 2023 | BigCode | Open-weight code generation model trained on permissively licensed code |
| Science | AlphaFold 2 | 2021 | DeepMind | Protein structure prediction at near-experimental accuracy |
| Science | RoseTTAFold | 2021 | Baker Lab | Open alternative for protein structure |
| Science | ESM-2 | 2022 | Meta AI | Protein language model trained on hundreds of millions of sequences |
| Robotics | RT-2 | 2023 | Google DeepMind | Vision-language-action model for robot control |
| Multimodal | GPT-4 | 2023 | OpenAI | Multimodal frontier model |
| Multimodal | Claude 3 | 2024 | Anthropic | Frontier model with vision input |
| Multimodal | Gemini | 2023 | Google DeepMind | Built natively multimodal across text, images, audio, video |
This is a partial list. By 2025 there were several hundred publicly named foundation models in active use, and many more proprietary ones used inside individual organizations.
Training a foundation model is dominated by three resources: data, compute, and engineering effort.
Text foundation models are pretrained on corpora ranging from hundreds of billions to many trillions of tokens. Common ingredients include the Common Crawl web scrape, English and multilingual Wikipedia, the Books3 corpus, scientific papers from arXiv and PubMed, code from GitHub, and curated dialogue and instruction data. The largest published text corpora as of 2025 contained on the order of 15 trillion tokens.[8]
Vision-language models are pretrained on image-text pairs scraped from the web. CLIP was trained on 400 million pairs; later models have used datasets in the billions. Quality, deduplication, and filtering matter as much as raw size; a smaller cleaner dataset often outperforms a larger noisier one for the same compute budget.[5]
Data provenance has become a contentious topic. Many text and image corpora include copyrighted works, personal data scraped without consent, and material whose authors did not anticipate AI training. Lawsuits over training data are pending in multiple jurisdictions, and several model developers have begun licensing data from publishers in part to reduce legal exposure.
Foundation model training is compute-intensive. GPT-3 was estimated to require around 3,640 petaflop-days of compute in 2020. By 2024, frontier models were estimated to use more than 100 times that. Training the largest publicly disclosed foundation models in 2025 cost in the high tens to low hundreds of millions of dollars, mostly in GPU rental or amortized hardware costs.[2]
The compute required for state-of-the-art training has roughly doubled every six months since 2010, far outpacing Moore's law. This trend has driven enormous capital investment into AI infrastructure: Nvidia GPUs, custom accelerators (TPUs at Google, Trainium at Amazon, MTIA at Meta), and the construction of dedicated data centers with multi-gigawatt power requirements.
The defining training technique for foundation models is self-supervision: the model learns by predicting parts of its input from other parts, with no need for human-provided labels. Specific objectives vary by modality:
| Modality | Common objective | Example models |
|---|---|---|
| Text | Next-token prediction (autoregressive) | GPT, LLaMA, Claude, Gemini |
| Text | Masked-token prediction | BERT, RoBERTa |
| Vision-text | Contrastive matching of image-text pairs | CLIP, ALIGN |
| Image | Denoising diffusion | Stable Diffusion, DALL-E 3, Imagen |
| Image | Masked-patch prediction | MAE, BEiT |
| Audio | Masked acoustic prediction; weak supervision | wav2vec 2.0, Whisper |
| Protein | Masked-amino-acid prediction | ESM-2 |
Self-supervision is what unlocks scale. Because no human labeling is required, training data can be collected cheaply at web scale, and the model learns general representations that transfer to many downstream tasks.
A pretrained foundation model is rarely deployed as-is. Almost every production use involves some form of adaptation to a specific task, domain, or behavioral profile.
| Method | What it changes | Cost | Typical use |
|---|---|---|---|
| Full fine-tuning | All model weights | High (full training) | Domain adaptation, behavior tuning |
| Instruction tuning | All weights, on instruction-response pairs | Moderate | Making a base model follow instructions |
| RLHF | Weights, via human-preference reward model | High | Aligning outputs with user intent |
| DPO | Weights, directly on preference data | Moderate | Lighter-weight alternative to RLHF |
| LoRA and adapters | Small low-rank deltas added to weights | Low | Parameter-efficient fine-tuning |
| Prefix and prompt tuning | A small set of soft tokens prepended to input | Very low | Lightweight task adaptation |
| In-context learning | Nothing (only the prompt) | None | Few-shot tasks at inference time |
| Prompt engineering | Nothing (only the prompt phrasing) | None | Steering frozen models without retraining |
| Retrieval-augmented generation | Adds a retrieval step before generation | Moderate | Grounding outputs in external knowledge |
| Tool use and function calling | Adds external API calls during generation | Moderate | Giving models access to calculators, search, code execution |
Fine-tuning, in its pure form, is increasingly rare for the largest commercial foundation models because the weights are not released. Most adaptation today is parameter-efficient (LoRA, adapters, soft prompts) or training-free (prompts, retrieval, tools). Open-weight models like LLaMA and Mistral support the full range of methods.
A distinctive property of large foundation models is in-context learning: the model learns a new task at inference time, from a few examples included in the prompt, with no weight updates. GPT-3 was the first model to make this property widely visible, and it has since been characterized in many follow-up papers. The mechanism is still imperfectly understood; it appears to depend on scale, training data composition, and the specific structure of the task.[5]
The Stanford report devoted substantial space to risks. Four families of concerns have remained central.
Because many downstream applications are built on a small number of foundation models, those few models become critical infrastructure. Defects propagate. If GPT-4 has a particular factual error or political slant or security flaw, every product built on GPT-4 inherits that property. If a frontier model is suddenly withdrawn (regulatory action, vendor decision, security incident), every dependent application is affected. The pre-foundation-model era did not have this single-point property to anything like the same degree.[1]
Foundation models trained on web-scraped data inherit the biases of that data. Studies have documented gender, racial, religious, and political biases in outputs across language models, image generators, and multimodal systems. When a single foundation model is used in thousands of downstream applications, its biases are not just reproduced but amplified through scale.[1] Mitigation work includes data filtering, RLHF on bias-related preferences, fine-tuning on counter-stereotyped data, and inference-time filters, none of which fully solve the problem.
Training and serving foundation models consumes substantial energy. Estimates of the carbon footprint of training a single large model in 2020 ranged from tens to hundreds of tonnes of CO2 equivalent (Strubell et al. 2019, Patterson et al. 2021). Serving costs at deployment scale now dominate training costs for popular models, and global data center electricity use has risen sharply with the foundation model boom. Water use for cooling and the mining of rare earths for hardware add additional environmental impacts.[1]
Foundation models lower the cost of producing convincing text, images, audio, and video. This has implications for disinformation campaigns, fraud, non-consensual intimate imagery, and the proliferation of malware. The largest models also raise concerns about uplift in the production of biological, chemical, radiological, and nuclear weapons (CBRN), and about cyber offensive capability. Governments have introduced reporting requirements, evaluation regimes, and pre-deployment safety testing for the most capable systems in part to address these concerns.[7][11]
Evaluation of foundation models is a research field of its own. The leading benchmark suite is HELM (Holistic Evaluation of Language Models), introduced by Liang, Bommasani, and colleagues at Stanford CRFM in November 2022 (arXiv:2211.09110). HELM was designed to address the previously fragmented state of language model evaluation, where different models were tested on different benchmarks under different conditions, making fair comparison impossible.[12]
HELM evaluates models along seven dimensions: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. The original release covered 30 prominent language models on 42 scenarios, raising the share of core scenarios with comparable evaluations across all models from 17.9% to 96.0%. HELM is maintained as a living benchmark with periodic releases that add new models, scenarios, and evaluation dimensions.[12]
| HELM dimension | What it measures |
|---|---|
| Accuracy | Standard task performance |
| Calibration | Whether the model's confidence matches its correctness |
| Robustness | Stability under perturbations to the input |
| Fairness | Disparities in performance across demographic groups |
| Bias | Skew in generated content along social dimensions |
| Toxicity | Production of harmful or offensive output |
| Efficiency | Compute and latency required for inference |
Other notable evaluation frameworks include MMLU (Hendrycks et al. 2020) for academic knowledge, BIG-bench (a community-contributed benchmark with more than 200 tasks), HumanEval and MBPP for code, MT-Bench and Chatbot Arena for instruction-following and dialogue, and various agent-oriented benchmarks like AgentBench and SWE-bench.
Foundation models have become a central object of AI regulation. Several jurisdictions have introduced bespoke rules for them.
The EU AI Act, adopted in 2024 and entering into force in August 2024 with phased applicability through 2027, regulates foundation models under the term "general-purpose AI model" (GPAI). A GPAI model is defined roughly as a model trained on broad data using self-supervision at scale that displays significant generality and can be integrated into a wide variety of downstream systems.[7]
All GPAI models are subject to baseline transparency obligations: technical documentation, summaries of training data, and copyright compliance procedures. A subset of GPAI models is classified as having "systemic risk," with substantially heavier obligations including model evaluations, adversarial testing, incident reporting, and cybersecurity protections. As of 2025 the systemic-risk threshold was set at training compute exceeding 10^25 floating-point operations, a level reached by GPT-4o, Gemini Ultra, Claude 3 Opus, Mistral Large 2, and a small number of other models. The systemic-risk obligations entered into application on 2 August 2025.[7]
The European Commission's AI Office is responsible for supervision of GPAI models. A General-Purpose AI Code of Practice provides a structured way for providers to demonstrate compliance until formal harmonized standards are published.
In the United States, President Biden's Executive Order 14110 of 30 October 2023 introduced the term "dual-use foundation model" as a regulatory category. The definition borrows directly from the Stanford report: a model trained on broad data, generally using self-supervision, containing at least tens of billions of parameters, applicable across a wide range of contexts, and exhibiting (or easily modified to exhibit) high levels of performance at tasks that pose serious risks to security, national economic security, or public health and safety.[11]
The order required developers of dual-use foundation models above specified compute thresholds to report training plans, safety test results, and information about cybersecurity protections to the U.S. government. It also tasked NIST with developing evaluation guidelines and the National AI Research Resource pilot. Executive Order 14110 was rescinded by President Trump on 20 January 2025 as part of a broader rollback of Biden-era AI regulation. Some of its underlying mechanisms (NIST work, voluntary commitments from developers) have continued in modified form; the formal reporting requirements have lapsed.
The United Kingdom hosted the AI Safety Summit at Bletchley Park in November 2023, which produced a declaration signed by 28 countries on the risks of frontier AI. China's Cyberspace Administration introduced rules on generative AI in 2023 that include pre-deployment safety assessments for foundation models offering services to the public. Canada, South Korea, Japan, Singapore, Brazil, and other countries have developed or are developing comparable frameworks.
As of early 2026, a small number of organizations dominate frontier foundation model development.
| Developer | Flagship foundation model (2025-2026) | Modality | Release model |
|---|---|---|---|
| OpenAI | GPT-4, GPT-5, o-series | Multimodal | API and product |
| Anthropic | Claude 4 family | Multimodal | API and product |
| Google DeepMind | Gemini 2.5 family | Multimodal | API and product |
| Meta AI | LLaMA 4 family | Multimodal | Open weights |
| Mistral AI | Mistral Large, Codestral, Pixtral | Multimodal | Mix of open and commercial |
| xAI | Grok | Multimodal | API and product |
| DeepSeek | DeepSeek V3, R1 | Multimodal | Open weights |
| Alibaba | Qwen series | Multimodal | Open weights |
| Baidu | Ernie series | Multimodal | API and product |
| Stability AI | Stable Diffusion 3, SD video | Image and video | Open weights |
| Black Forest Labs | FLUX | Image | Mix of open and commercial |
| Cohere | Command series | Text | API |
| AI21 Labs | Jamba | Text | Open weights and API |
The industry has split, by rough convention, into closed-weight providers (OpenAI, Anthropic, most of Google) and open-weight providers (Meta, Mistral, DeepSeek, many Chinese labs, several smaller Western labs). Both sides have produced state-of-the-art models. The gap between the best open-weight and best closed-weight models has narrowed considerably between 2023 and 2026.
The foundation model market has also concentrated at the infrastructure layer. Nvidia supplies the overwhelming majority of training accelerators. AWS, Microsoft Azure, and Google Cloud provide the hosting. Scale AI, Surge AI, and a handful of other firms supply the human-labeled data used for instruction tuning and RLHF. This concentration is itself a source of regulatory and competitive concern.[1]
The Foundation Model Transparency Index (FMTI), maintained by Stanford CRFM, scores major foundation model developers on 100 indicators across data, compute, model characteristics, and downstream impact. The first edition in October 2023 found that no developer scored above 54 out of 100, with average scores around 37. The 2024 update reported some improvement across 14 developers but persistent gaps in data transparency, labor practices, and downstream usage reporting.[13]
Transparency disputes have been a recurring source of friction. Developers of closed-weight models typically publish little detail about training data, training compute, training methodology, or evaluation results beyond marketing-friendly headline numbers. Open-weight model developers typically publish more, though with significant variation. The push for greater transparency has come from researchers (who need methodological detail to study the systems), regulators (who need information to enforce rules), and downstream developers (who need to understand the models they are building on).
The term "foundation model" has been criticized on several grounds.
Linguist Emily Bender and others have argued that the term is misleading because it suggests a stable, well-understood substrate when in fact these models are opaque, brittle, and behave inconsistently. Calling them "foundations" implies they can safely support load, which has not been demonstrated.[6]
Some researchers have argued that the term is essentially a rebrand of "large language model" designed to give Stanford intellectual ownership of a phenomenon that already had a name. The fact that most discourse about "foundation models" focuses on language models (rather than CLIP, AlphaFold, or Whisper) lends some weight to this critique.
From a different direction, researchers like Yann LeCun have argued that current foundation models are not foundational in any deep sense, because they lack world models, persistent memory, planning, and other capabilities he considers necessary for real intelligence. From this view, treating LLM-style foundation models as the basis for general AI is a category error.
Despite these objections, the term has become entrenched in research, industry, and policy. It is now the standard label in the European Union's AI Act (under the related term "general-purpose AI model"), in the (rescinded) U.S. Executive Order on AI, in NIST glossaries, and in most academic literature published since 2022.