Pre-Trained Model

A pre-trained model is a machine learning model that has been previously trained on a large dataset, typically for a general task, and can then be adapted to new, related tasks. Rather than building a model from scratch and initializing its weights randomly, practitioners start with a pre-trained model that has already learned useful representations from large-scale data. This approach is central to transfer learning, where knowledge gained from one task is applied to a different but related task. Pre-trained models have become the dominant paradigm in modern deep learning, powering applications across computer vision, natural language processing, speech recognition, and multimodal AI.

The practice of pre-training has transformed AI research and industry. Instead of requiring enormous labeled datasets and weeks of compute time for every new application, teams can download a pre-trained model and fine-tune it on a smaller, task-specific dataset. This reduces cost, training time, and data requirements while often improving performance compared to training from scratch.

Explain like I'm 5 (ELI5)

Imagine you want to teach someone to cook Italian food. You could start by teaching them what a stove is, how to hold a knife, and what salt tastes like. Or you could find someone who already knows how to cook French food and just teach them the differences for Italian recipes. A pre-trained model is like that experienced cook: it already knows lots of general things (how ingredients combine, how heat works), so it can learn a new style of cooking much faster than starting from zero.

History and development

Early foundations (1976 to 2012)

The concept of transferring learned knowledge between tasks dates back to Stevo Bozinovski's 1976 work on neural network training transfer. However, the modern notion of transfer learning took shape with the work of Thrun and Pratt in 1998 and was later formalized in a survey by Pan and Yang in 2009. These works established that a model trained on one domain can improve generalization on a related domain, particularly when labeled data in the target domain is scarce.

In computer vision, the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) marked a turning point. AlexNet, designed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, achieved a top-5 error rate of 15.3%, beating the runner-up by more than 10 percentage points. This result demonstrated that deep convolutional neural networks (CNNs) trained on large image datasets could learn highly transferable visual features. AlexNet's success triggered widespread adoption of ImageNet-pretrained CNNs as starting points for downstream vision tasks.

The ImageNet pretraining era (2013 to 2017)

Following AlexNet, a series of deeper and more sophisticated CNN architectures were developed and pre-trained on ImageNet:

Model	Year	Key innovation	Parameters	ILSVRC top-5 error
AlexNet	2012	Deep CNN with ReLU, dropout, GPU training	60 million	15.3%
VGGNet	2014	Uniform 3x3 convolutions, increased depth	138 million (VGG-16)	7.3%
GoogLeNet (Inception)	2014	Inception modules with parallel filter sizes	6.8 million	6.7%
ResNet	2015	Residual (skip) connections enabling 152+ layers	60 million (ResNet-152)	3.6%
DenseNet	2017	Dense connections between all layers	20 million (DenseNet-201)	~3.5%

Research by Yosinski et al. (2014), published at NeurIPS, experimentally quantified the transferability of features at different layers of a deep neural network. The study found that early layers learn general features (such as edge detectors and color blobs) applicable to many tasks, while later layers become increasingly task-specific. Initializing a network with transferred features from almost any number of layers produced a boost to generalization, even after fine-tuning to a new dataset. This finding provided the theoretical justification for using ImageNet-pretrained models as feature extractors or initialization points across nearly all computer vision tasks, including object detection, image segmentation, and pose estimation.

Word embeddings and early NLP pretraining (2013 to 2018)

In natural language processing, pre-training began with word embeddings. Word2Vec, introduced by Tomas Mikolov et al. at Google in 2013, trained shallow neural networks on large text corpora to produce dense vector representations of words. GloVe (Global Vectors for Word Representation), developed at Stanford in 2014, used word co-occurrence statistics to learn similar embeddings. Both Word2Vec and GloVe produced static, context-independent embeddings, meaning a word like "bank" would have the same vector regardless of whether it referred to a financial institution or a river bank.

ELMo (Embeddings from Language Models), released by the Allen Institute for AI in February 2018, addressed this limitation. ELMo used a deep bidirectional LSTM pre-trained on a corpus of one billion words to produce context-dependent word representations. Unlike Word2Vec and GloVe, ELMo generated different embeddings for the same word depending on its surrounding context. ELMo's pre-trained representations improved performance on a wide range of NLP benchmarks and marked the transition from pre-trained word vectors to pre-trained full models in NLP.

The transformer revolution (2018 to present)

The release of BERT (Bidirectional Encoder Representations from Transformers) by Google in October 2018 and GPT (Generative Pre-trained Transformer) by OpenAI in June 2018 established the modern paradigm of large-scale pre-training on unlabeled text using transformer architectures.

BERT used two pre-training objectives: masked language modeling (MLM), where 15% of input tokens are randomly masked and the model predicts them from bidirectional context, and next sentence prediction (NSP), where the model determines whether two sentences are consecutive. BERT was pre-trained on English Wikipedia (2.5 billion words) and the BookCorpus dataset (800 million words). The BERT-Base model contained 110 million parameters, while BERT-Large contained 340 million parameters.

The GPT series took a different approach, using autoregressive (left-to-right) language modeling as its pre-training objective. The progression from GPT-1 to GPT-3 demonstrated the power of scaling:

Model	Year	Parameters	Pre-training data	Pre-training objective
GPT-1	2018	117 million	BookCorpus	Autoregressive LM
GPT-2	2019	1.5 billion	WebText (40 GB)	Autoregressive LM
GPT-3	2020	175 billion	Common Crawl + others (570 GB)	Autoregressive LM
GPT-4	2023	Undisclosed (rumored ~1.7 trillion)	Undisclosed	Autoregressive LM + RLHF

This era also produced a wide range of other pre-trained transformer models, including RoBERTa, ALBERT, XLNet, ELECTRA, DeBERTa, and T5.

Pre-training methods

Supervised pre-training

In supervised pre-training, a model is trained on a large labeled dataset before being adapted to a target task. The classic example is training a CNN on the ImageNet classification task (1,000 object categories across 1.2 million images) and then transferring the learned features to a new task. Supervised pre-training works well when a large, high-quality labeled dataset is available for the source task. However, creating such labeled datasets is expensive, and supervised pre-training can introduce biases present in the source labels.

Unsupervised pre-training

Unsupervised pre-training trains models to learn representations from unlabeled data by reconstructing or predicting parts of the input. Traditional approaches include autoencoders, which compress input data into a lower-dimensional representation and then reconstruct it, and clustering algorithms like k-means. Unsupervised pre-training was historically used as an initialization strategy for deep networks before techniques like batch normalization and better activation functions made training deep networks from scratch more feasible.

Self-supervised pre-training

Self-supervised pre-training generates supervisory signals from the data itself, without requiring human-provided labels. This approach has become the dominant pre-training paradigm in both NLP and computer vision.

In NLP, self-supervised objectives include:

Masked language modeling (MLM): Used by BERT and its variants. The model masks a portion of input tokens and learns to predict them from context.
Autoregressive language modeling: Used by the GPT series. The model predicts the next token given all preceding tokens.
Denoising objectives: Used by T5 and BART. The model reconstructs text from corrupted input.

In computer vision, self-supervised methods include:

Contrastive learning: Methods like SimCLR (Chen et al., 2020) and MoCo (He et al., 2020) train models to recognize that different augmented views of the same image should have similar representations, while views from different images should be dissimilar.
Masked image modeling: Masked Autoencoders (MAE), introduced by He et al. in 2022, apply the masked prediction concept from NLP to images. The model masks random patches of an input image and learns to reconstruct the missing pixels. MAE was directly inspired by BERT's masked token prediction.
Self-distillation: DINO (Caron et al., 2021) trains a student network to match the output of a teacher network (an exponential moving average of the student), producing strong visual representations without labels.

Contrastive language-image pre-training

CLIP (Contrastive Language-Image Pre-training), introduced by OpenAI in February 2021, represents a multimodal pre-training approach. CLIP jointly trains an image encoder (either a ResNet or a Vision Transformer) and a text encoder (a 12-layer transformer) on 400 million image-text pairs collected from the internet. The model learns to match images with their corresponding text descriptions using a contrastive objective. CLIP's pre-trained representations enable zero-shot classification on new image categories without any task-specific fine-tuning, simply by comparing image embeddings with text embeddings of category descriptions.

Fine-tuning and adaptation strategies

Once a model has been pre-trained, it must be adapted to a specific downstream task. Several strategies exist, ranging from simple to sophisticated.

Feature extraction

The simplest adaptation approach treats the pre-trained model as a fixed feature extractor. The pre-trained model's weights are frozen, and only a new output layer (or a small classifier head) is trained on the target dataset. This approach works well when the target dataset is small and the pre-training domain is similar to the target domain. In computer vision, this often means extracting features from one of the later layers of a pre-trained CNN and training a linear classifier or small neural network on top.

Full fine-tuning

Full fine-tuning updates all parameters of the pre-trained model on the target task's data. The model is typically initialized with the pre-trained weights and trained with a lower learning rate than would be used for training from scratch. This is the most expressive adaptation method, allowing the model to adjust all of its representations to the target task. However, full fine-tuning is computationally expensive for very large models and can lead to overfitting when the target dataset is small.

Gradual unfreezing

A middle ground between feature extraction and full fine-tuning, gradual unfreezing starts by training only the new output layer while keeping all pre-trained layers frozen. Then, layers are progressively unfrozen from top to bottom, allowing deeper layers to be adjusted only after higher layers have adapted to the new task. This technique, popularized by Howard and Ruder (2018) in ULMFiT, helps prevent catastrophic forgetting of the pre-trained representations.

Parameter-efficient fine-tuning (PEFT)

As pre-trained models have grown to billions of parameters, full fine-tuning has become impractical for many users. Parameter-efficient fine-tuning methods update only a small subset of model parameters while keeping the rest frozen.

Method	Year	Approach	Parameter reduction
Adapter tuning	2019	Insert small bottleneck modules between transformer layers	~3-4% of original parameters
Prefix tuning	2021	Prepend trainable vectors to each transformer layer's input	~0.1% of original parameters
LoRA (Low-Rank Adaptation)	2021	Decompose weight updates into low-rank matrices	Up to 10,000x fewer trainable parameters
QLoRA	2023	Combine LoRA with 4-bit quantization	Further memory reduction over LoRA
DoRA	2024	Decompose weights into magnitude and direction for LoRA	Similar to LoRA with improved accuracy

LoRA, proposed by Hu et al. in 2021, has become the most widely used PEFT method. Instead of updating the full weight matrix of a pre-trained model, LoRA freezes the original weights and injects trainable low-rank decomposition matrices into each layer. For GPT-3 (175 billion parameters), LoRA reduced the number of trainable parameters by 10,000 times and GPU memory requirements by three times compared to full fine-tuning with Adam. In some cases, LoRA has even outperformed full fine-tuning because it avoids catastrophic forgetting of pre-trained knowledge.

Prompt tuning and in-context learning

For very large language models, adaptation can be achieved without modifying any model parameters at all. Prompt tuning appends a small set of learnable "soft" tokens to the input, while in-context learning (popularized by GPT-3) provides task demonstrations directly in the input prompt. These approaches enable a single pre-trained model to perform many different tasks without any weight updates.

Foundation models

In August 2021, Stanford University's Center for Research on Foundation Models (CRFM) published a report titled "On the Opportunities and Risks of Foundation Models," which introduced the term foundation model to describe pre-trained models that are trained on broad data at scale and can be adapted to a wide range of downstream tasks. Examples include BERT, GPT-3, DALL-E, and CLIP.

The report argued that while foundation models are built on conventional deep learning and transfer learning, their scale results in new emergent capabilities not present in smaller models. Their effectiveness across many tasks also incentivizes homogenization, where a few large foundation models serve as the base for a wide variety of applications. This shift has significant implications for AI research, deployment, and governance.

Foundation models now span multiple modalities:

Domain	Example models	Pre-training data
Text	GPT-4, Claude, Llama, Mistral	Trillions of text tokens from the web, books, and code
Images	Stable Diffusion, DALL-E, Imagen	Billions of image-text pairs
Code	Codex, StarCoder, Code Llama	Billions of lines of source code
Multimodal	Gemini, GPT-4V, CLIP	Text, images, audio, and video jointly
Audio/Speech	Whisper, Wav2Vec	Hundreds of thousands of hours of audio

Scaling laws for pre-training

Research into scaling laws has provided guidance on how to allocate compute budgets for pre-training. Kaplan et al. (2020) at OpenAI published the first scaling laws, observing that model performance improves predictably as a power law with increases in model size, dataset size, and compute.

The Chinchilla scaling laws, published by Hoffmann et al. at DeepMind in 2022, refined these findings. By training over 400 language models ranging from 70 million to 16 billion parameters on 5 to 500 billion tokens, the researchers found that the model size and the number of training tokens should be scaled equally for compute-optimal training. The optimal ratio was approximately 20 tokens per parameter.

This finding had significant practical impact. The Chinchilla model (70 billion parameters trained on 1.4 trillion tokens) outperformed much larger models like Gopher (280 billion parameters) and GPT-3 (175 billion parameters) that had been trained on fewer tokens relative to their size. The Chinchilla-optimal ratio of approximately 20 tokens per parameter became an industry benchmark, though subsequent models like Llama 3 (trained with over 200 tokens per parameter) have shown that "over-training" beyond this ratio can be worthwhile when inference cost matters more than training cost.

Challenges and limitations

Catastrophic forgetting

When a pre-trained model is fine-tuned on a new task, it can lose knowledge acquired during pre-training. This phenomenon, known as catastrophic forgetting, is a fundamental challenge in continual learning. Fine-tuning large language models on specific domains frequently degrades their performance on previously learned tasks. Techniques to mitigate this include elastic weight consolidation (EWC), progressive memory banks, rehearsal-based methods that replay examples from earlier tasks, and parameter-efficient fine-tuning methods like LoRA.

Domain shift

Pre-trained models may not transfer well when there is a large gap between the pre-training domain and the target domain. For example, a model pre-trained on natural images may perform poorly when applied to medical imaging or satellite imagery without significant adaptation. The degree of domain shift determines how much fine-tuning data and compute are needed for effective transfer.

Bias and fairness

Pre-trained models inherit biases present in their training data. Since large-scale pre-training datasets are typically collected from the internet, they reflect societal biases related to gender, race, and other protected attributes. These biases can propagate to all downstream applications that build on the pre-trained model. Addressing bias in pre-trained models requires careful dataset curation, debiasing techniques during or after training, and ongoing evaluation on fairness benchmarks.

Computational and environmental cost

Pre-training large models requires substantial computational resources. Training GPT-3 (175 billion parameters) was estimated to cost several million dollars and consumed significant energy. As models continue to scale, the environmental impact of pre-training has become a growing concern. This has motivated research into more efficient pre-training methods, model compression, and the sharing of pre-trained models through public repositories.

Security and misuse

Pre-trained models can be exploited for malicious purposes, including generating misinformation, deepfakes, or harmful code. The open release of powerful pre-trained models requires balancing the benefits of open research with the risks of misuse. Some organizations have adopted staged release strategies (as OpenAI did with GPT-2) or restricted access through APIs.

Model compression for pre-trained models

Deploying large pre-trained models in production environments often requires reducing their size and computational requirements. The main compression techniques include:

Pruning: Systematically removing less important weights or neurons from the model. Structured pruning removes entire filters or attention heads, while unstructured pruning removes individual weights. NVIDIA has developed methods combining structured weight pruning with knowledge distillation to compress LLMs without significant quality loss.
Quantization: Reducing the numerical precision of model weights from 32-bit floating point (FP32) to lower-precision formats like 8-bit integer (INT8) or 4-bit. This reduces memory usage and speeds up inference, particularly on specialized hardware. Post-training quantization can be applied to any pre-trained model without retraining.
Knowledge distillation: Training a smaller "student" model to mimic the outputs of a larger "teacher" model. The student learns to reproduce not just the teacher's final predictions but also its intermediate representations. DistilBERT, for example, is a distilled version of BERT that retains 97% of BERT's language understanding while being 60% smaller and 60% faster.