Foundation models
Template:About Template:Infobox software
Foundation models (FMs) are large machine learning models trained on broad data, generally via self-supervised learning at scale, that can then be adapted (for example via fine-tuning) to a wide range of downstream tasks across modalities.[1][2] The term emphasizes that such models serve as a common foundation upon which many downstream tasks and applications can be built, rather than bespoke systems trained separately for each task. Many prominent large language models (LLMs) and multimodal systems are commonly described as foundation models.[1]
Building foundation models is often highly resource-intensive, with the most advanced models costing hundreds of millions of dollars to cover the expenses of acquiring, curating, and processing massive datasets, as well as the compute power required for training.[3] In contrast, adapting existing foundation models for specific tasks or using them directly is far less costly, as it leverages pre-trained capabilities and typically requires only fine-tuning on smaller, task-specific datasets.[3]
Definition and scope
The phrase foundation model was popularized by researchers associated with Stanford's Center for Research on Foundation Models (CRFM) in August 2021 to describe models trained on broad data at scale using mostly self-supervision and adaptable to many tasks.[1] The researchers chose "foundation" over "foundational" because "foundational" implies that these models provide fundamental principles in a way that "foundation" does not.[3] They also noted that preexisting terms were inadequate: "'(large) language model' was too narrow given [the] focus is not only language; 'self-supervised model' was too specific to the training objective; and 'pretrained model' suggested that the noteworthy action all happened after 'pretraining.'"[1]
Foundation models differ from large language models (LLMs), as LLMs are a subset specifically focused on interpreting, generating, and manipulating human language, while foundation models encompass broader modalities like text, images, video, or other data types.[4] All LLMs can be considered foundation models, but not all foundation models are LLMs.[5]
Legal definitions
Subsequent policy and standards documents have adopted related terminology. In the United States, Executive Order 14110 defines a subcategory of "dual-use foundation model" as an artificial intelligence model trained on broad data, generally using self-supervision, containing at least tens of billions of parameters, applicable across many contexts, and exhibiting, or easily modifiable to exhibit, high levels of performance at tasks that pose serious risks to security, national economic security, or public health or safety.[6]
NIST documents and glossaries similarly characterize foundation models as broadly trained, self-supervised systems adaptable to varied tasks.[2][7]
In the European Union's AI Act, closely related terminology, general-purpose AI models (GPAI), is used for models intended for integration into many downstream systems, with specific obligations (including additional duties for models with systemic risk).[8]
History
Technologically, foundation models are built using established machine learning techniques like deep neural networks, transfer learning, and self-supervised learning.[3] The concept of pre-training a large model on a general dataset and then fine-tuning it for specific tasks has roots in earlier work on transfer learning with models like Word2vec and GloVe. However, the paradigm shifted significantly with the introduction of the Transformer architecture in 2017.[9]
Subsequent models like Google's BERT (Bidirectional Encoder Representations from Transformers) in 2018 and OpenAI's GPT series demonstrated the power of large-scale, pre-trained language models.[10] As these models grew in size and capability, their potential applications expanded far beyond their initial scope.
The 2022 releases of Stable Diffusion and ChatGPT (initially powered by the GPT-3.5 model) led to foundation models and generative AI entering widespread public discourse.[3] Further releases of LLaMA, Llama 2, and Mistral in 2023 contributed to a greater emphasis placed on how foundation models are released, with open foundation models garnering significant support and scrutiny.[3]
Characteristics
Foundation models are distinguished by several defining characteristics:
- Broad pretraining data: FMs are trained on diverse, large-scale corpora (text, images, audio, code, etc.), typically using self-supervised learning objectives such as next-token prediction or masked modeling.[1][11]
- Adaptability: After pretraining, they can be adapted via fine-tuning, instruction tuning, or reinforcement learning from human feedback (RLHF) to perform specific tasks or follow instructions more reliably.[12][13]
- Scale: Foundation models operate at massive scale in three dimensions, data scale (billions to trillions of tokens), model scale (tens of billions to trillions of parameters), and compute scale (thousands of GPUs for training).[14]
- Emergence: Quantitative increases in scale lead to new qualitative capabilities that were not present in smaller versions, such as zero-shot learning and few-shot learning.[15]
- Homogenization: A wide array of applications are built on a small number of foundation models, creating both efficiency gains and systemic risks.[1]
Architecture and Training
Most modern foundation models are based on the Transformer architecture, though diffusion models are widely used for image, audio, and video generation.[16] Contrastive learning across image–text pairs (for example CLIP) is a common multimodal pretraining strategy.[17]
Pretraining
Pretraining typically uses self-supervised learning on broad corpora to learn general representations. For text, common objectives include next-token autoregressive modeling and masked language modeling; for images and audio, objectives include masked/denoising prediction and diffusion-based reconstruction.[1][9][16]
Adaptation methods
- Fine-tuning: Updating model weights on task- or domain-specific data.[1]
- Instruction tuning: Supervised fine-tuning on collections of natural-language instructions to improve zero-shot task following (for example FLAN).[12]
- Reinforcement learning from human feedback (RLHF): Optimizing a reward model learned from human preferences to align outputs with user intent (for example InstructGPT).[13]
- Prompt engineering: Crafting inputs to elicit desired outputs without changing model weights, including zero-shot, one-shot, and few-shot prompting.[18]
- Tool use / retrieval-augmented generation: Integrating external knowledge bases or tools to access up-to-date or specialized information.[19]
Notable Examples
| Model | Modality | Parameters | Year | Developer | Key Features | Source |
|---|---|---|---|---|---|---|
| BERT | Text (LLM) | 340M | 2018 | First bidirectional foundation model | [10] | |
| GPT-3 | Text (LLM) | 175B | 2020 | OpenAI | Demonstrated strong few-shot learning | [18] |
| GPT-4 | Multimodal | >1T (estimated) | 2023 | OpenAI | Advanced multimodal capabilities | [20] |
| Claude series | Text (LLM) | Various | 2023-2024 | Anthropic | Advanced reasoning and safety features | [21] |
| Gemini | Multimodal | Various | 2023-2024 | State-of-the-art multimodal model | [22] | |
| Llama 2 | Text (LLM) | 7B-70B | 2023 | Meta AI | Open-weight foundation model | [23] |
| BLOOM | Text (LLM) | 176B | 2022 | BigScience | Multilingual, supports 46 languages | [24] |
| CLIP | Vision-Text | 400M | 2021 | OpenAI | Contrastive text-image pretraining | [17] |
| DALL-E 3 | Text-to-Image | Unknown | 2023 | OpenAI | High-quality text-to-image generation | [25] |
| Stable Diffusion | Text-to-Image | 890M | 2022 | Stability AI | Open-source image generation | [26] |
| FlamingoFlamingo | Multimodal | 80B | 2022 | DeepMind | Visual language model for few-shot learning | [27] |
| AlphaFold 2 | Protein structure | 21M | 2021 | DeepMind | Protein structure prediction | [28] |
Applications
Foundation models enable a broad range of applications, often after modest adaptation:
Natural Language Processing
- Question answering, summarization, translation, code generation, and dialogue assistants based on large language models.[18]
- Sentiment analysis and text classification
- Content generation for articles, marketing copy, emails, and creative writing
Computer Vision
- Zero-shot classification and open-vocabulary recognition via contrastive pretraining[17]
- Text-to-image generation via diffusion[16]
- Object detection, image segmentation, and optical character recognition
Scientific Research
- Drug discovery and molecular design[28]
- Climate modeling and prediction
- Materials science and chemistry applications
- Genomics and radiology applications[29]
Code and Software Development
- Code completion and generation (GitHub Copilot, Amazon CodeWhisperer)
- Bug detection and fixing
- Code explanation and documentation
- Language translation between programming languages
Business and Industry
- Customer service chatbots and virtual assistants
- Supply chain optimization and predictive maintenance
- Fraud detection and risk assessment
- Automated investment research and financial analysis[30]
Robotics
- Visual navigation and task planning
- Human-robot interaction
- Manipulation tasks
- Environment simulation using world models[31]
Governance, Safety, and Regulation
Policy makers and standards bodies have proposed governance approaches tailored to foundation models:
- United States: The 2023 Executive Order introduces reporting and safety requirements tied to high-risk "dual-use foundation models," defining them by training scale and potential risk profile.[6][32]
- NIST: Guidance on managing misuse risk for dual-use FMs provides terminology and recommended practices for developers and deployers.[7]
- European Union: The AI Act establishes obligations for general-purpose AI models (GPAI), including heightened duties for models with systemic risk and a Code of Practice pathway.[8]
Frontier Models
Certain highly advanced foundation models are termed "frontier models," which have the potential to "possess dangerous capabilities sufficient to pose severe risks to public safety."[3] These capabilities may include:
- Designing and synthesizing new biological or chemical weapons
- Producing and propagating convincing, tailored disinformation
- Harnessing unprecedented offensive cyber capabilities
- Evading human control through deceptive means
Transparency and Documentation
Research initiatives evaluate the transparency of FM developers across data, compute, model characteristics, and downstream impact:
- Foundation Model Transparency Index (FMTI): An index specifying 100 indicators to measure transparency, with reports in 2023 (v1.0) and 2024 (v1.1). The 2023 paper documented low average transparency; the 2024 update reported improvements across 14 developers while noting persistent gaps.[33][34]
Challenges and Limitations
Computational Requirements
- Training costs can reach hundreds of millions of dollars[3]
- Require thousands of GPUs for training
- Environmental impact from energy consumption
- Computational power required has doubled every 3.4 months since 2012[14]
Bias and Fairness
- Models can perpetuate and amplify biases present in training data[1]
- Social and demographic biases
- Geographic and cultural biases
- Language biases favoring well-represented languages
- Risk of homogenization spreading biases across all downstream applications
Data Quality and Privacy
- Web-scraped data often contains toxic, biased, or copyrighted material[3]
- Privacy concerns when training on user data
- Challenges in data curation at scale
- Risk of memorizing and reproducing training data
Evaluation Challenges
- Difficulty in comprehensively evaluating general-purpose models
- Emergent capabilities that are hard to predict
- Need for new benchmarks and evaluation frameworks
Supply Chain
The supply chain for foundation models involves upstream resources (data from providers like Scale AI, Surge AI; compute from AWS, Google Cloud, Microsoft Azure) and downstream adaptations.[3] Costs concentrate compute (80% of 2023 AI capital), leading to market consolidation among few companies.[3]
Release Strategies
Release strategies for foundation models include:
- API access: Users query the model through an interface (for example OpenAI's GPT-4)
- Open weights: Model weights available for download (for example Meta's Llama)
- Closed/Limited access: Restricted to specific users or organizations
See also
- Large language model
- Transformer (machine learning model)
- Diffusion model
- Contrastive learning
- Self-supervised learning
- Transfer learning
- Fine-tuning
- Instruction tuning
- Reinforcement learning from human feedback
- Retrieval-augmented generation
- General-purpose AI
- Generative artificial intelligence
- Multimodal learning
- Few-shot learning
References
- ↑ 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 Stanford Center for Research on Foundation Models (CRFM), On the Opportunities and Risks of Foundation Models (2021). https://crfm.stanford.edu/report.html ; arXiv: https://arxiv.org/abs/2108.07258
- ↑ 2.0 2.1 NIST CSRC Glossary, foundation model. https://csrc.nist.gov/glossary/term/foundation_model
- ↑ 3.00 3.01 3.02 3.03 3.04 3.05 3.06 3.07 3.08 3.09 3.10 Foundation model - Wikipedia. https://en.wikipedia.org/wiki/Foundation_model
- ↑ Foundation Model vs LLM: Key Differences Explained - Openxcell. https://www.openxcell.com/blog/foundation-model-vs-llm/
- ↑ Foundation Models Vs LLM(Large Language Models) - LinkedIn. https://www.linkedin.com/pulse/foundation-models-vs-llmlarge-language-aman-walia-aslsc
- ↑ 6.0 6.1 Executive Order 14110 (Oct. 30, 2023), Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Federal Register: https://www.federalregister.gov/documents/2023/11/01/2023-24283/safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence ; White House archive: https://bidenwhitehouse.archives.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/
- ↑ 7.0 7.1 NIST, Managing Misuse Risk for Dual-Use Foundation Models (AI 800-1 initial/draft publications, 2024–2025). https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-1.ipd.pdf ; https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-1.ipd2.pdf
- ↑ 8.0 8.1 European Commission, General-Purpose AI Models in the AI Act – Questions & Answers (2025). https://digital-strategy.ec.europa.eu/en/faqs/general-purpose-ai-models-ai-act-questions-answers
- ↑ 9.0 9.1 Vaswani et al., Attention Is All You Need (2017). arXiv: https://arxiv.org/abs/1706.03762 ; NeurIPS PDF: https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf
- ↑ 10.0 10.1 Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018). https://arxiv.org/abs/1810.04805
- ↑ IBM, What is self-supervised learning?. https://www.ibm.com/think/topics/self-supervised-learning
- ↑ 12.0 12.1 Jason Wei et al., Finetuned Language Models Are Zero-Shot Learners (arXiv 2021; ICLR 2022). https://arxiv.org/abs/2109.01652 ; https://openreview.net/pdf?id=gEZrGCozdqR
- ↑ 13.0 13.1 Long Ouyang et al., Training language models to follow instructions with human feedback (NeurIPS 2022). https://arxiv.org/abs/2203.02155 ; https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf
- ↑ 14.0 14.1 What are Foundation Models? - Foundation Models in Generative AI Explained - AWS. https://aws.amazon.com/what-is/foundation-models/
- ↑ Wei et al., Emergent Abilities of Large Language Models (2022). https://arxiv.org/abs/2206.07682
- ↑ 16.0 16.1 16.2 Ho, Jain, Abbeel, Denoising Diffusion Probabilistic Models (2020). arXiv: https://arxiv.org/abs/2006.11239 ; NeurIPS PDF: https://proceedings.neurips.cc/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf
- ↑ 17.0 17.1 17.2 Radford et al., Learning Transferable Visual Models From Natural Language Supervision (2021). arXiv: https://arxiv.org/abs/2103.00020 ; ICML PDF: https://proceedings.mlr.press/v139/radford21a/radford21a.pdf
- ↑ 18.0 18.1 18.2 Tom B. Brown et al., Language Models are Few-Shot Learners (GPT-3; 2020). arXiv: https://arxiv.org/abs/2005.14165 ; NeurIPS PDF: https://papers.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
- ↑ Patrick Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020). arXiv: https://arxiv.org/abs/2005.11401 ; NeurIPS PDF: https://proceedings.neurips.cc/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf
- ↑ OpenAI, GPT-4 Technical Report (2023). https://arxiv.org/abs/2303.08774
- ↑ Anthropic, Claude 3 Model Card (2024). https://www.anthropic.com/claude
- ↑ Google, Gemini: A Family of Highly Capable Multimodal Models (2023). https://arxiv.org/abs/2312.11805
- ↑ Hugo Touvron et al., Llama 2: Open Foundation and Fine-Tuned Chat Models (2023). arXiv: https://arxiv.org/abs/2307.09288 ; Meta page: https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/
- ↑ BigScience Workshop, BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (2022). https://arxiv.org/abs/2211.05100
- ↑ OpenAI, DALL-E 3 System Card (2023). https://cdn.openai.com/papers/DALL_E_3_System_Card.pdf
- ↑ Rombach et al., High-Resolution Image Synthesis with Latent Diffusion Models (2022). https://arxiv.org/abs/2112.10752
- ↑ Alayrac et al., Flamingo: a Visual Language Model for Few-Shot Learning (2022). https://arxiv.org/abs/2204.14198
- ↑ 28.0 28.1 Jumper et al., Highly accurate protein structure prediction with AlphaFold (Nature, 2021). https://doi.org/10.1038/s41586-021-03819-2
- ↑ Healthcare applications in Stanford FM report. https://crfm.stanford.edu/report.html#healthcare
- ↑ Foundation Models And LLMs: 19 Real-World, Practical Use Cases - Forbes (2025). https://www.forbes.com/councils/forbestechcouncil/2025/02/05/foundation-models-and-llms-19-real-world-practical-use-cases/
- ↑ NVIDIA Cosmos World Foundation Models (2025). https://blogs.nvidia.com/blog/world-foundation-models/
- ↑ U.S. Congressional Research Service, The AI Executive Order and Its Potential Implications for DOD (Dec. 12, 2023). https://www.congress.gov/crs-product/IN12286
- ↑ Rishi Bommasani et al., The Foundation Model Transparency Index (Oct. 2023). https://arxiv.org/abs/2310.12941 ; Stanford HAI explainer: https://hai.stanford.edu/news/introducing-foundation-model-transparency-index
- ↑ Rishi Bommasani et al., The Foundation Model Transparency Index v1.1: May 2024 (paper & CRFM update). https://crfm.stanford.edu/fmti/paper.pdf ; https://crfm.stanford.edu/2024/05/21/fmti-may-2024.html