Template:About Template:Infobox software
Foundation models (FMs) are large machine learning models trained on broad data, generally via self-supervised learning at scale, that can then be adapted (for example via fine-tuning) to a wide range of downstream tasks across modalities.[1][2] The term emphasizes that such models serve as a common foundation upon which many downstream tasks and applications can be built, rather than bespoke systems trained separately for each task. Many prominent large language models (LLMs) and multimodal systems are commonly described as foundation models.[1]
Building foundation models is often highly resource-intensive, with the most advanced models costing hundreds of millions of dollars to cover the expenses of acquiring, curating, and processing massive datasets, as well as the compute power required for training.[3] In contrast, adapting existing foundation models for specific tasks or using them directly is far less costly, as it leverages pre-trained capabilities and typically requires only fine-tuning on smaller, task-specific datasets.[3]
The phrase foundation model was popularized by researchers associated with Stanford's Center for Research on Foundation Models (CRFM) in August 2021 to describe models trained on broad data at scale using mostly self-supervision and adaptable to many tasks.[1] The researchers chose "foundation" over "foundational" because "foundational" implies that these models provide fundamental principles in a way that "foundation" does not.[3] They also noted that preexisting terms were inadequate: "'(large) language model' was too narrow given [the] focus is not only language; 'self-supervised model' was too specific to the training objective; and 'pretrained model' suggested that the noteworthy action all happened after 'pretraining.'"[1]
Foundation models differ from large language models (LLMs), as LLMs are a subset specifically focused on interpreting, generating, and manipulating human language, while foundation models encompass broader modalities like text, images, video, or other data types.[4] All LLMs can be considered foundation models, but not all foundation models are LLMs.[5]
Subsequent policy and standards documents have adopted related terminology. In the United States, Executive Order 14110 defines a subcategory of "dual-use foundation model" as an artificial intelligence model trained on broad data, generally using self-supervision, containing at least tens of billions of parameters, applicable across many contexts, and exhibiting, or easily modifiable to exhibit, high levels of performance at tasks that pose serious risks to security, national economic security, or public health or safety.[6]
NIST documents and glossaries similarly characterize foundation models as broadly trained, self-supervised systems adaptable to varied tasks.[2][7]
In the European Union's AI Act, closely related terminology, general-purpose AI models (GPAI), is used for models intended for integration into many downstream systems, with specific obligations (including additional duties for models with systemic risk).[8]
Technologically, foundation models are built using established machine learning techniques like deep neural networks, transfer learning, and self-supervised learning.[3] The concept of pre-training a large model on a general dataset and then fine-tuning it for specific tasks has roots in earlier work on transfer learning with models like Word2vec and GloVe. However, the paradigm shifted significantly with the introduction of the Transformer architecture in 2017.[9]
Subsequent models like Google's BERT (Bidirectional Encoder Representations from Transformers) in 2018 and OpenAI's GPT series demonstrated the power of large-scale, pre-trained language models.[10] As these models grew in size and capability, their potential applications expanded far beyond their initial scope.
The 2022 releases of Stable Diffusion and ChatGPT (initially powered by the GPT-3.5 model) led to foundation models and generative AI entering widespread public discourse.[3] Further releases of LLaMA, Llama 2, and Mistral in 2023 contributed to a greater emphasis placed on how foundation models are released, with open foundation models garnering significant support and scrutiny.[3]
Foundation models are distinguished by several defining characteristics:
Most modern foundation models are based on the Transformer architecture, though diffusion models are widely used for image, audio, and video generation.[16] Contrastive learning across image–text pairs (for example CLIP) is a common multimodal pretraining strategy.[17]
Pretraining typically uses self-supervised learning on broad corpora to learn general representations. For text, common objectives include next-token autoregressive modeling and masked language modeling; for images and audio, objectives include masked/denoising prediction and diffusion-based reconstruction.[1][9][16]
| Model | Modality | Parameters | Year | Developer | Key Features | Source |
|---|---|---|---|---|---|---|
| BERT | Text (LLM) | 340M | 2018 | First bidirectional foundation model | [10] | |
| GPT-3 | Text (LLM) | 175B | 2020 | OpenAI | Demonstrated strong few-shot learning | [18] |
| GPT-4 | Multimodal | >1T (estimated) | 2023 | OpenAI | Advanced multimodal capabilities | [20] |
| Claude series | Text (LLM) | Various | 2023-2024 | Anthropic | Advanced reasoning and safety features | [21] |
| Gemini | Multimodal | Various | 2023-2024 | State-of-the-art multimodal model | [22] | |
| Llama 2 | Text (LLM) | 7B-70B | 2023 | Meta AI | Open-weight foundation model | [23] |
| BLOOM | Text (LLM) | 176B | 2022 | BigScience | Multilingual, supports 46 languages | [24] |
| CLIP | Vision-Text | 400M | 2021 | OpenAI | Contrastive text-image pretraining | [17] |
| DALL-E 3 | Text-to-Image | Unknown | 2023 | OpenAI | High-quality text-to-image generation | [25] |
| Stable Diffusion | Text-to-Image | 890M | 2022 | Stability AI | Open-source image generation | [26] |
| FlamingoFlamingo | Multimodal | 80B | 2022 | DeepMind | Visual language model for few-shot learning | [27] |
| AlphaFold 2 | Protein structure | 21M | 2021 | DeepMind | Protein structure prediction | [28] |
Foundation models enable a broad range of applications, often after modest adaptation:
Policy makers and standards bodies have proposed governance approaches tailored to foundation models:
Certain highly advanced foundation models are termed "frontier models," which have the potential to "possess dangerous capabilities sufficient to pose severe risks to public safety."[3] These capabilities may include:
Research initiatives evaluate the transparency of FM developers across data, compute, model characteristics, and downstream impact:
The supply chain for foundation models involves upstream resources (data from providers like Scale AI, Surge AI; compute from AWS, Google Cloud, Microsoft Azure) and downstream adaptations.[3] Costs concentrate compute (80% of 2023 AI capital), leading to market consolidation among few companies.[3]
Release strategies for foundation models include: