Template:Infobox machine learning technique
Pre-training is a foundational machine learning paradigm where neural network models first learn general representations from massive unlabeled datasets before being adapted to specific downstream tasks through fine-tuning. This two-stage approach has revolutionized artificial intelligence across domains, enabling models to achieve superior performance with dramatically less task-specific labeled data than training from scratch.[1]
Pre-training is the initial training phase of a model on a broad dataset or task to learn general patterns and representations before being fine-tuned on a specific problem.[2] In this stage, a model (often called a foundation model when it is general-purpose) is trained on large-scale data, frequently using unlabeled data and self-supervised learning objectives, to acquire a broad understanding of features or knowledge.[3]
Modern foundation models like GPT-4, BERT, and CLIP derive their capabilities primarily from pre-training on trillions of tokens or billions of images, learning rich patterns that transfer effectively across countless downstream applications.[3] Pre-training addresses the fundamental challenge of data scarcity: rather than requiring millions of labeled examples for each task, models pre-trained on general data can adapt to new tasks with mere thousands or even dozens of examples.
Pre-training's conceptual roots trace to 2006-2007 when Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh introduced unsupervised layer-wise pre-training for deep belief networks using Restricted Boltzmann Machines.[4] This breakthrough enabled training deep networks that were previously intractable due to vanishing gradient problems—each layer pre-trained unsupervised, then the full network fine-tuned with supervision.[5]
The paradigm shifted dramatically in 2012 when AlexNet won ImageNet with a top-5 error of 15.3% compared to 26.2% for second place—an unprecedented 10.9% margin.[6] This established supervised large-scale pre-training as the dominant transfer learning approach for computer vision.
Word embeddings emerged as natural language processing's first scalable pre-training method. Word2Vec (2013) introduced CBOW and Skip-gram architectures that learned dense vector representations capturing semantic relationships.[7] GloVe (2014) combined global matrix factorization with local context windows.[8] These static embeddings gave way to contextualized representations with ELMo (2018), which used bidirectional LSTMs to generate different embeddings for the same word based on context.[9]
The June 2017 paper "Attention Is All You Need" introduced transformers, fundamentally changing AI.[10] The architecture eliminated recurrence and convolution, relying entirely on multi-head self-attention mechanisms that compute relationships between all positions in parallel.
BERT's October 2018 release demonstrated transformers' power for language understanding through bidirectional pre-training using masked language modeling.[11] GPT paralleled BERT with a unidirectional approach, with GPT-3 (2020) demonstrating that scale plus autoregressive pre-training yields emergent capabilities like few-shot learning.[12]
The development of large-scale AI models is now dominated by a two-stage paradigm that separates general knowledge acquisition from task-specific adaptation.[13]
This two-stage approach represents a fundamental philosophical shift in machine learning, moving the field away from building highly specialized, single-task models from scratch toward creating generalist, reusable foundation models.
Pre-training is the core mechanism that enables transfer learning.[14][15] Transfer learning is a machine learning method where a model developed for one task is reused as the starting point for a model on a second, related task.[15] The pre-trained model is the tangible artifact that stores the knowledge to be transferred.
This transfer can be implemented in two primary ways:
Pre-training operates through self-supervised learning objectives that enable models to extract knowledge from unlabeled data at massive scale. The process involves:
Modern pre-training runs for weeks to months on thousands of GPUs or TPUs, processing datasets measured in terabytes.[17]
Popularized by BERT, MLM randomly masks approximately 15% of input tokens and trains models to predict them using full bidirectional context.[11] The masking strategy:
Variants include:
The GPT family employs autoregressive pre-training, predicting each token given all previous tokens in left-to-right fashion.[21] The objective maximizes likelihood: P(x₁, ..., xₙ) = ∏P(xᵢ|x₁, ..., xᵢ₋₁)
Contrastive learning revolutionized self-supervised pre-training in computer vision and multimodal domains:
| Feature | Generative Pre-training | Contrastive Pre-training |
|---|---|---|
| Core Objective | Reconstruct or predict parts of the input data; model the data distribution | Learn an embedding space where similar samples are close and dissimilar samples are far apart |
| Supervision Signal | The input data itself (for example the original unmasked token, the complete image) | The relationship between pairs of data points (positive vs. negative pairs) |
| Typical Architectures | Autoencoders (AEs, VAEs), GANs, Autoregressive Models (for example Transformers) | Siamese Networks, models using InfoNCE or Triplet Loss objectives (for example SimCLR, MoCo) |
| Example Pretext Tasks | Masked Language Modeling (BERT), Next-Token Prediction (GPT), Image Inpainting, Denoising | Identifying an augmented version of an image from a batch of other images |
| Strengths | Can generate new data; learns a rich, dense representation of the data distribution | Often learns representations highly effective for downstream classification tasks; can be more sample-efficient |
| Weaknesses | Can be computationally expensive; may have inferior data scaling capacity | Can be data-hungry and prone to over-fitting on limited data; sensitive to negative sample choice |
| Dataset | Size | Description | Used by |
|---|---|---|---|
| Common Crawl | 320+ TB raw | Web crawl data | Most modern LLMs |
| C4 | 750 GB | Cleaned Common Crawl | T5, many others |
| The Pile | 825 GB | 22 diverse sources | GPT-Neo, GPT-J |
| RefinedWeb | 5+ trillion tokens | Filtered Common Crawl | Falcon |
| RedPajama | 1.2 trillion tokens | Open reproduction of LLaMA data | Open models |
| BookCorpus | 800M words | 11,000+ books | BERT, GPT |
| Wikipedia | 2.5B words (English) | Encyclopedia articles | BERT, GPT, most LLMs |
| Dataset | Size | Description | Primary use |
|---|---|---|---|
| ImageNet | 1.2M images | 1000 object classes | Supervised pre-training |
| JFT-300M | 300M images | Google internal dataset | Large-scale pre-training |
| LAION-5B | 5.85B pairs | Image-text pairs from web | CLIP-style training |
| DataComp | 12.8B pairs | CommonPool for research | Multimodal research |
| COCO | 330K images | Object detection/segmentation | Vision tasks |
| Model | Parameters | Training time | Hardware | Estimated cost |
|---|---|---|---|---|
| BERT-Base | 110M | 4 days | 16 TPUs | $500-1,000 |
| GPT-3 | 175B | ~34 days | 1024 A100s | $4.6 million |
| Llama 2 (7B) | 7B | 1-2 weeks | 64-128 A100s | $200,000-500,000 |
| Llama 3.1 (405B) | 405B | 30.84M GPU-hours | 24,576 H100s | $10-20 million |
| T5 | 11B | 4 weeks | 256 TPU v3 | $1.5 million |
| CLIP | 400M | 12 days | 592 V100s | $600,000 |
| Model | Domain | Release Year | Parameters | Key Objective | Developer |
|---|---|---|---|---|---|
| Word2Vec | NLP | 2013 | 300 dim | Skip-gram/CBOW | |
| ResNet-50 | CV | 2015 | 25M | Image Classification | Microsoft |
| BERT | NLP | 2018 | 340M | MLM, NSP | |
| GPT-3 | NLP | 2020 | 175B | Autoregressive LM | OpenAI |
| RoBERTa | NLP | 2019 | 355M | Dynamic MLM | |
| T5 | NLP | 2019 | 11B | Text-to-Text | |
| Vision Transformer | CV | 2020 | 86M-632M | Image Classification | |
| CLIP | Multimodal | 2021 | 400M | Contrastive Alignment | OpenAI |
| DALL-E | Multimodal | 2021 | 12B | Text-to-Image | OpenAI |
| ELECTRA | NLP | 2020 | 340M | Replaced Token Detection | |
| XLNet | NLP | 2019 | 340M | Permutation LM | Google/CMU |
| Llama 2 | NLP | 2023 | 7B-70B | Autoregressive LM | Meta |
| Flamingo | Multimodal | 2022 | 80B | Visual Language | DeepMind |
Pre-trained language models power nearly all modern NLP applications:[14][28]
Pre-trained vision models serve as backbones for diverse applications:[31]
Pre-training offers several critical advantages:[36]
The computational demands of pre-training create significant environmental costs:
Pre-trained models inherit and amplify societal biases present in training data:[40]
Some researchers suggest the era of scaling pre-training datasets is ending:[51]