A pre-trained model is a machine learning model that has been previously trained on a large dataset, typically for a general task, and can then be adapted to new, related tasks. Rather than building a model from scratch and initializing its weights randomly, practitioners start with a pre-trained model that has already learned useful representations from large-scale data. This approach is central to transfer learning, where knowledge gained from one task is applied to a different but related task. Pre-trained models have become the dominant paradigm in modern deep learning, powering applications across computer vision, natural language processing, speech recognition, and multimodal AI.
The practice of pre-training has transformed AI research and industry. Instead of requiring enormous labeled datasets and weeks of compute time for every new application, teams can download a pre-trained model and fine-tune it on a smaller, task-specific dataset. This reduces cost, training time, and data requirements while often improving performance compared to training from scratch.
Imagine you want to teach someone to cook Italian food. You could start by teaching them what a stove is, how to hold a knife, and what salt tastes like. Or you could find someone who already knows how to cook French food and just teach them the differences for Italian recipes. A pre-trained model is like that experienced cook: it already knows lots of general things (how ingredients combine, how heat works), so it can learn a new style of cooking much faster than starting from zero.
The concept of transferring learned knowledge between tasks dates back to Stevo Bozinovski's 1976 work on neural network training transfer. However, the modern notion of transfer learning took shape with the work of Thrun and Pratt in 1998 and was later formalized in a survey by Pan and Yang in 2009. These works established that a model trained on one domain can improve generalization on a related domain, particularly when labeled data in the target domain is scarce.
In computer vision, the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) marked a turning point. AlexNet, designed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, achieved a top-5 error rate of 15.3%, beating the runner-up by more than 10 percentage points. This result demonstrated that deep convolutional neural networks (CNNs) trained on large image datasets could learn highly transferable visual features. AlexNet's success triggered widespread adoption of ImageNet-pretrained CNNs as starting points for downstream vision tasks.
Following AlexNet, a series of deeper and more sophisticated CNN architectures were developed and pre-trained on ImageNet:
| Model | Year | Key innovation | Parameters | ILSVRC top-5 error |
|---|---|---|---|---|
| AlexNet | 2012 | Deep CNN with ReLU, dropout, GPU training | 60 million | 15.3% |
| VGGNet | 2014 | Uniform 3x3 convolutions, increased depth | 138 million (VGG-16) | 7.3% |
| GoogLeNet (Inception) | 2014 | Inception modules with parallel filter sizes | 6.8 million | 6.7% |
| ResNet | 2015 | Residual (skip) connections enabling 152+ layers | 60 million (ResNet-152) | 3.6% |
| DenseNet | 2017 | Dense connections between all layers | 20 million (DenseNet-201) | ~3.5% |
Research by Yosinski et al. (2014), published at NeurIPS, experimentally quantified the transferability of features at different layers of a deep neural network. The study found that early layers learn general features (such as edge detectors and color blobs) applicable to many tasks, while later layers become increasingly task-specific. Initializing a network with transferred features from almost any number of layers produced a boost to generalization, even after fine-tuning to a new dataset. This finding provided the theoretical justification for using ImageNet-pretrained models as feature extractors or initialization points across nearly all computer vision tasks, including object detection, image segmentation, and pose estimation.
In natural language processing, pre-training began with word embeddings. Word2Vec, introduced by Tomas Mikolov et al. at Google in 2013, trained shallow neural networks on large text corpora to produce dense vector representations of words. GloVe (Global Vectors for Word Representation), developed at Stanford in 2014, used word co-occurrence statistics to learn similar embeddings. Both Word2Vec and GloVe produced static, context-independent embeddings, meaning a word like "bank" would have the same vector regardless of whether it referred to a financial institution or a river bank.
ELMo (Embeddings from Language Models), released by the Allen Institute for AI in February 2018, addressed this limitation. ELMo used a deep bidirectional LSTM pre-trained on a corpus of one billion words to produce context-dependent word representations. Unlike Word2Vec and GloVe, ELMo generated different embeddings for the same word depending on its surrounding context. ELMo's pre-trained representations improved performance on a wide range of NLP benchmarks and marked the transition from pre-trained word vectors to pre-trained full models in NLP.
The release of BERT (Bidirectional Encoder Representations from Transformers) by Google in October 2018 and GPT (Generative Pre-trained Transformer) by OpenAI in June 2018 established the modern paradigm of large-scale pre-training on unlabeled text using transformer architectures.
BERT used two pre-training objectives: masked language modeling (MLM), where 15% of input tokens are randomly masked and the model predicts them from bidirectional context, and next sentence prediction (NSP), where the model determines whether two sentences are consecutive. BERT was pre-trained on English Wikipedia (2.5 billion words) and the BookCorpus dataset (800 million words). The BERT-Base model contained 110 million parameters, while BERT-Large contained 340 million parameters.
The GPT series took a different approach, using autoregressive (left-to-right) language modeling as its pre-training objective. The progression from GPT-1 to GPT-3 demonstrated the power of scaling:
| Model | Year | Parameters | Pre-training data | Pre-training objective |
|---|---|---|---|---|
| GPT-1 | 2018 | 117 million | BookCorpus | Autoregressive LM |
| GPT-2 | 2019 | 1.5 billion | WebText (40 GB) | Autoregressive LM |
| GPT-3 | 2020 | 175 billion | Common Crawl + others (570 GB) | Autoregressive LM |
| GPT-4 | 2023 | Undisclosed (rumored ~1.7 trillion) | Undisclosed | Autoregressive LM + RLHF |
This era also produced a wide range of other pre-trained transformer models, including RoBERTa, ALBERT, XLNet, ELECTRA, DeBERTa, and T5.
In supervised pre-training, a model is trained on a large labeled dataset before being adapted to a target task. The classic example is training a CNN on the ImageNet classification task (1,000 object categories across 1.2 million images) and then transferring the learned features to a new task. Supervised pre-training works well when a large, high-quality labeled dataset is available for the source task. However, creating such labeled datasets is expensive, and supervised pre-training can introduce biases present in the source labels.
Unsupervised pre-training trains models to learn representations from unlabeled data by reconstructing or predicting parts of the input. Traditional approaches include autoencoders, which compress input data into a lower-dimensional representation and then reconstruct it, and clustering algorithms like k-means. Unsupervised pre-training was historically used as an initialization strategy for deep networks before techniques like batch normalization and better activation functions made training deep networks from scratch more feasible.
Self-supervised pre-training generates supervisory signals from the data itself, without requiring human-provided labels. This approach has become the dominant pre-training paradigm in both NLP and computer vision.
In NLP, self-supervised objectives include:
In computer vision, self-supervised methods include:
CLIP (Contrastive Language-Image Pre-training), introduced by OpenAI in February 2021, represents a multimodal pre-training approach. CLIP jointly trains an image encoder (either a ResNet or a Vision Transformer) and a text encoder (a 12-layer transformer) on 400 million image-text pairs collected from the internet. The model learns to match images with their corresponding text descriptions using a contrastive objective. CLIP's pre-trained representations enable zero-shot classification on new image categories without any task-specific fine-tuning, simply by comparing image embeddings with text embeddings of category descriptions.
Once a model has been pre-trained, it must be adapted to a specific downstream task. Several strategies exist, ranging from simple to sophisticated.
The simplest adaptation approach treats the pre-trained model as a fixed feature extractor. The pre-trained model's weights are frozen, and only a new output layer (or a small classifier head) is trained on the target dataset. This approach works well when the target dataset is small and the pre-training domain is similar to the target domain. In computer vision, this often means extracting features from one of the later layers of a pre-trained CNN and training a linear classifier or small neural network on top.
Full fine-tuning updates all parameters of the pre-trained model on the target task's data. The model is typically initialized with the pre-trained weights and trained with a lower learning rate than would be used for training from scratch. This is the most expressive adaptation method, allowing the model to adjust all of its representations to the target task. However, full fine-tuning is computationally expensive for very large models and can lead to overfitting when the target dataset is small.
A middle ground between feature extraction and full fine-tuning, gradual unfreezing starts by training only the new output layer while keeping all pre-trained layers frozen. Then, layers are progressively unfrozen from top to bottom, allowing deeper layers to be adjusted only after higher layers have adapted to the new task. This technique, popularized by Howard and Ruder (2018) in ULMFiT, helps prevent catastrophic forgetting of the pre-trained representations.
As pre-trained models have grown to billions of parameters, full fine-tuning has become impractical for many users. Parameter-efficient fine-tuning methods update only a small subset of model parameters while keeping the rest frozen.
| Method | Year | Approach | Parameter reduction |
|---|---|---|---|
| Adapter tuning | 2019 | Insert small bottleneck modules between transformer layers | ~3-4% of original parameters |
| Prefix tuning | 2021 | Prepend trainable vectors to each transformer layer's input | ~0.1% of original parameters |
| LoRA (Low-Rank Adaptation) | 2021 | Decompose weight updates into low-rank matrices | Up to 10,000x fewer trainable parameters |
| QLoRA | 2023 | Combine LoRA with 4-bit quantization | Further memory reduction over LoRA |
| DoRA | 2024 | Decompose weights into magnitude and direction for LoRA | Similar to LoRA with improved accuracy |
LoRA, proposed by Hu et al. in 2021, has become the most widely used PEFT method. Instead of updating the full weight matrix of a pre-trained model, LoRA freezes the original weights and injects trainable low-rank decomposition matrices into each layer. For GPT-3 (175 billion parameters), LoRA reduced the number of trainable parameters by 10,000 times and GPU memory requirements by three times compared to full fine-tuning with Adam. In some cases, LoRA has even outperformed full fine-tuning because it avoids catastrophic forgetting of pre-trained knowledge.
For very large language models, adaptation can be achieved without modifying any model parameters at all. Prompt tuning appends a small set of learnable "soft" tokens to the input, while in-context learning (popularized by GPT-3) provides task demonstrations directly in the input prompt. These approaches enable a single pre-trained model to perform many different tasks without any weight updates.
In August 2021, Stanford University's Center for Research on Foundation Models (CRFM) published a report titled "On the Opportunities and Risks of Foundation Models," which introduced the term foundation model to describe pre-trained models that are trained on broad data at scale and can be adapted to a wide range of downstream tasks. Examples include BERT, GPT-3, DALL-E, and CLIP.
The report argued that while foundation models are built on conventional deep learning and transfer learning, their scale results in new emergent capabilities not present in smaller models. Their effectiveness across many tasks also incentivizes homogenization, where a few large foundation models serve as the base for a wide variety of applications. This shift has significant implications for AI research, deployment, and governance.
Foundation models now span multiple modalities:
| Domain | Example models | Pre-training data |
|---|---|---|
| Text | GPT-4, Claude, Llama, Mistral | Trillions of text tokens from the web, books, and code |
| Images | Stable Diffusion, DALL-E, Imagen | Billions of image-text pairs |
| Code | Codex, StarCoder, Code Llama | Billions of lines of source code |
| Multimodal | Gemini, GPT-4V, CLIP | Text, images, audio, and video jointly |
| Audio/Speech | Whisper, Wav2Vec | Hundreds of thousands of hours of audio |
Research into scaling laws has provided guidance on how to allocate compute budgets for pre-training. Kaplan et al. (2020) at OpenAI published the first scaling laws, observing that model performance improves predictably as a power law with increases in model size, dataset size, and compute.
The Chinchilla scaling laws, published by Hoffmann et al. at DeepMind in 2022, refined these findings. By training over 400 language models ranging from 70 million to 16 billion parameters on 5 to 500 billion tokens, the researchers found that the model size and the number of training tokens should be scaled equally for compute-optimal training. The optimal ratio was approximately 20 tokens per parameter.
This finding had significant practical impact. The Chinchilla model (70 billion parameters trained on 1.4 trillion tokens) outperformed much larger models like Gopher (280 billion parameters) and GPT-3 (175 billion parameters) that had been trained on fewer tokens relative to their size. The Chinchilla-optimal ratio of approximately 20 tokens per parameter became an industry benchmark, though subsequent models like Llama 3 (trained with over 200 tokens per parameter) have shown that "over-training" beyond this ratio can be worthwhile when inference cost matters more than training cost.
When a pre-trained model is fine-tuned on a new task, it can lose knowledge acquired during pre-training. This phenomenon, known as catastrophic forgetting, is a fundamental challenge in continual learning. Fine-tuning large language models on specific domains frequently degrades their performance on previously learned tasks. Techniques to mitigate this include elastic weight consolidation (EWC), progressive memory banks, rehearsal-based methods that replay examples from earlier tasks, and parameter-efficient fine-tuning methods like LoRA.
Pre-trained models may not transfer well when there is a large gap between the pre-training domain and the target domain. For example, a model pre-trained on natural images may perform poorly when applied to medical imaging or satellite imagery without significant adaptation. The degree of domain shift determines how much fine-tuning data and compute are needed for effective transfer.
Pre-trained models inherit biases present in their training data. Since large-scale pre-training datasets are typically collected from the internet, they reflect societal biases related to gender, race, and other protected attributes. These biases can propagate to all downstream applications that build on the pre-trained model. Addressing bias in pre-trained models requires careful dataset curation, debiasing techniques during or after training, and ongoing evaluation on fairness benchmarks.
Pre-training large models requires substantial computational resources. Training GPT-3 (175 billion parameters) was estimated to cost several million dollars and consumed significant energy. As models continue to scale, the environmental impact of pre-training has become a growing concern. This has motivated research into more efficient pre-training methods, model compression, and the sharing of pre-trained models through public repositories.
Pre-trained models can be exploited for malicious purposes, including generating misinformation, deepfakes, or harmful code. The open release of powerful pre-trained models requires balancing the benefits of open research with the risks of misuse. Some organizations have adopted staged release strategies (as OpenAI did with GPT-2) or restricted access through APIs.
Deploying large pre-trained models in production environments often requires reducing their size and computational requirements. The main compression techniques include:
The widespread adoption of pre-trained models has led to the creation of dedicated repositories and hubs where researchers and practitioners can discover, download, and share models.
| Repository | Maintained by | Notable features |
|---|---|---|
| Hugging Face Hub | Hugging Face | Over 1 million models; supports PyTorch, TensorFlow, JAX; model cards and community discussions |
| TensorFlow Hub | Pre-trained models for TF ecosystem; SavedModel format; easy integration with Keras | |
| PyTorch Hub | Meta | Research-focused; reproducing published results; tight integration with PyTorch |
| Kaggle Models | Models with competition datasets; notebook integration | |
| NVIDIA NGC | NVIDIA | Optimized models for NVIDIA GPUs; enterprise-grade containers |
Hugging Face has emerged as the largest and most widely used model hub. Its Transformers library, with over 84,000 GitHub stars, provides a unified API for loading and using pre-trained models across different frameworks. The platform hosts models from individual researchers, academic labs, and companies including Meta, Mistral, DeepSeek, and Stability AI.
Pre-trained models are used across virtually every area of applied AI: