Pre-training
Last reviewed
Sources
54 citations
Review status
Source-backed
Revision
v5 · 3,755 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
54 citations
Review status
Source-backed
Revision
v5 · 3,755 words
Add missing citations, update stale details, or suggest a clearer explanation.
Pre-training is the first and most compute-intensive stage of building a modern AI model: a neural network is trained on a massive, mostly unlabeled dataset using self-supervised learning to learn general-purpose representations, before it is later adapted to specific tasks through fine-tuning.[1] The pre-trained result is a versatile foundation model, which the team that coined the term defined as "any model that is trained on broad data that can be adapted to a wide range of downstream tasks."[2] This two-stage approach has reshaped artificial intelligence across language, vision, and multimodal domains, letting models reach high accuracy with far less task-specific labeled data than training from scratch requires.[1]
Modern foundation models like GPT-4, BERT, and CLIP derive their capabilities primarily from pre-training on trillions of tokens or hundreds of millions of images, learning rich patterns that transfer effectively across countless downstream applications.[3] Pre-training addresses the fundamental challenge of data scarcity: rather than requiring millions of labeled examples for each task, a model pre-trained on general data can adapt to new tasks with mere thousands or even dozens of examples.
Pre-training is the initial training phase of a model on a broad dataset or task to learn general patterns and representations before it is fine-tuned on a specific problem.[1] In this stage, a model (often called a foundation model when it is general-purpose) is trained on large-scale data, frequently using unlabeled data and self-supervised learning objectives, to acquire a broad understanding of features or knowledge.[3] The objective is typically self-supervised, such as predicting the next word in a sentence or filling in masked parts of an image, so no human labels are needed and the data itself supplies the supervision signal.
Pre-training's conceptual roots trace to 2006, when Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh published "A Fast Learning Algorithm for Deep Belief Nets" in Neural Computation, introducing greedy, layer-wise unsupervised pre-training for deep belief networks using Restricted Boltzmann Machines.[4] This breakthrough enabled training deep networks that were previously intractable due to vanishing gradient problems: each layer was pre-trained unsupervised, then the full network was fine-tuned with supervision.[5]
The paradigm shifted dramatically in 2012 when AlexNet won the ImageNet Large Scale Visual Recognition Challenge with a top-5 error of 15.3%, compared to 26.2% for the second-place entry, a margin of 10.9 percentage points.[6] This established supervised large-scale pre-training as the dominant transfer learning approach for computer vision.
Word embeddings emerged as natural language processing's first scalable pre-training method. Word2Vec (2013) introduced CBOW and Skip-gram architectures that learned dense vector representations capturing semantic relationships.[7] GloVe (2014) combined global matrix factorization with local context windows.[8] These static embeddings gave way to contextualized representations with ELMo (2018), which used bidirectional LSTMs to generate different embeddings for the same word based on context.[9]
The paper "Attention Is All You Need," posted to arXiv on June 12, 2017, introduced transformers, fundamentally changing AI.[10] The architecture eliminated recurrence and convolution, relying entirely on multi-head self-attention mechanisms that compute relationships between all positions in parallel.
BERT's October 2018 release demonstrated transformers' power for language understanding through bidirectional pre-training using masked language modeling. As Devlin et al. wrote, "BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers."[11] GPT paralleled BERT with a unidirectional, autoregressive approach. With GPT-3 (2020), OpenAI introduced a 175-billion-parameter model, which the paper described as "10x more than any previous non-sparse language model," demonstrating that scale plus autoregressive pre-training yields emergent capabilities like few-shot learning.[12]
The development of large-scale AI models is now dominated by a two-stage paradigm that separates general knowledge acquisition from task-specific adaptation.[13]
Stage 1: Pre-training: In this initial, computationally intensive phase, a model is trained on a massive, often unlabeled, dataset. The objective is typically self-supervised, such as predicting the next word in a sentence or filling in missing parts of an image.[13] This stage is where the model learns fundamental concepts, from the grammar and syntax of language to the textures and shapes of visual objects. The result of this stage is a pre-trained model, which serves as a versatile foundation.[5]
Stage 2: Fine-tuning: The pre-trained model is then adapted for a specific application by continuing the training process on a much smaller, task-specific, and typically labeled dataset. For example, a language model pre-trained on the entire internet might be fine-tuned on a dataset of customer reviews to perform sentiment analysis.[11] This step adjusts the model's pre-existing parameters to specialize its knowledge for the target task.[14]
This two-stage approach represents a fundamental philosophical shift in machine learning, moving the field away from building highly specialized, single-task models from scratch toward creating generalist, reusable foundation models.
Pre-training is the core mechanism that enables transfer learning.[14][15] Transfer learning is a machine learning method where a model developed for one task is reused as the starting point for a model on a second, related task.[15] The pre-trained model is the tangible artifact that stores the knowledge to be transferred.
This transfer can be implemented in two primary ways:
Feature Extraction: The pre-trained model is used as a fixed feature extractor with parameters frozen, and only new task-specific layers are trained.[16]
Fine-tuning: Some or all of the pre-trained model's parameters are updated during training on the new dataset, allowing deeper adaptation to the new task.[16]
Pre-training operates through self-supervised learning objectives that enable models to extract knowledge from unlabeled data at massive scale. The process involves:
Data collection: Gathering datasets from sources like Common Crawl (15+ trillion tokens) or ImageNet (14+ million images)
Pretext task design: Creating self-supervised objectives like predicting masked words, forecasting next tokens, or matching image-text pairs
Training process:
Forward propagation through deep neural networks (typically transformers with 12-96 layers)
Calculating loss against self-supervised objectives
Backpropagating gradients
Updating billions of parameters using optimizers like Adam or AdamW
Modern pre-training runs for weeks to months on thousands of GPUs or TPUs, processing datasets measured in terabytes.[17]
Popularized by BERT, MLM randomly masks approximately 15% of input tokens and trains models to predict them using full bidirectional context.[11] The masking strategy:
80% of selected tokens become [MASK]
10% swap to random tokens
10% remain unchanged
Variants include:
RoBERTa: Removed next sentence prediction, used dynamic masking, trained on 10x more data[18]
SpanBERT: Masks contiguous spans rather than individual tokens[19]
ELECTRA: Discriminative pre-training detecting replaced tokens[20]
The GPT family employs autoregressive pre-training, predicting each token given all previous tokens in left-to-right fashion.[21] The objective maximizes likelihood: P(x₁, ..., xₙ) = ∏P(xᵢ|x₁, ..., xᵢ₋₁)
Contrastive learning revolutionized self-supervised pre-training in computer vision and multimodal domains:
SimCLR: Learns representations maximizing agreement between augmented views of same image[22]
CLIP: Jointly trains image and text encoders on 400 million image-text pairs[23]
MAE: Masks 75% of image patches and reconstructs missing pixels[24]
| Feature | Generative Pre-training | Contrastive Pre-training |
|---|---|---|
| Core Objective | Reconstruct or predict parts of the input data; model the data distribution | Learn an embedding space where similar samples are close and dissimilar samples are far apart |
| Supervision Signal | The input data itself (for example the original unmasked token, the complete image) | The relationship between pairs of data points (positive vs. negative pairs) |
| Typical Architectures | Autoencoders (AEs, VAEs), GANs, Autoregressive Models (for example Transformers) | Siamese Networks, models using InfoNCE or Triplet Loss objectives (for example SimCLR, MoCo) |
| Example Pretext Tasks | Masked Language Modeling (BERT), Next-Token Prediction (GPT), Image Inpainting, Denoising | Identifying an augmented version of an image from a batch of other images |
| Strengths | Can generate new data; learns a rich, dense representation of the data distribution | Often learns representations highly effective for downstream classification tasks; can be more sample-efficient |
| Weaknesses | Can be computationally expensive; may have inferior data scaling capacity | Can be data-hungry and prone to over-fitting on limited data; sensitive to negative sample choice |
| Dataset | Size | Description | Used by |
|---|---|---|---|
| Common Crawl | 320+ TB raw | Web crawl data | Most modern LLMs |
| C4 | 750 GB | Cleaned Common Crawl | T5, many others |
| The Pile | 825 GB | 22 diverse sources | GPT-Neo, GPT-J |
| RefinedWeb | 5+ trillion tokens | Filtered Common Crawl | Falcon |
| RedPajama | 1.2 trillion tokens | Open reproduction of LLaMA data | Open models |
| BookCorpus | 800M words | 11,000+ books | BERT, GPT |
| Wikipedia | 2.5B words (English) | Encyclopedia articles | BERT, GPT, most LLMs |
| Dataset | Size | Description | Primary use |
|---|---|---|---|
| ImageNet | 1.2M images | 1000 object classes | Supervised pre-training |
| JFT-300M | 300M images | Google internal dataset | Large-scale pre-training |
| LAION-5B | 5.85B pairs | Image-text pairs from web | CLIP-style training |
| DataComp | 12.8B pairs | CommonPool for research | Multimodal research |
| COCO | 330K images | Object detection/segmentation | Vision tasks |
Pre-training a frontier model is among the most expensive computations in industry, with budgets ranging from a few hundred dollars for a small encoder to tens of millions of dollars for the largest models. Lambda Labs estimated that pre-training GPT-3 required roughly 3.14 x 10^23 floating-point operations, which at $1.5 per GPU-hour on a V100 server would cost about $4.6 million.[12] Meta reported that pre-training Llama 3.1 405B on more than 15 trillion tokens took 30.84 million GPU-hours on a cluster of 16,384 NVIDIA H100 80GB GPUs.[25]
| Model | Parameters | Training time | Hardware | Estimated cost |
|---|---|---|---|---|
| BERT-Base | 110M | 4 days | 16 TPUs | $500-1,000 |
| GPT-3 | 175B | ~34 days | 1024 A100s | $4.6 million |
| Llama 2 (7B) | 7B | 1-2 weeks | 64-128 A100s | $200,000-500,000 |
| Llama 3.1 (405B) | 405B | 30.84M GPU-hours | 16,384 H100s | $10-20 million |
| T5 | 11B | 4 weeks | 256 TPU v3 | $1.5 million |
| CLIP | 400M | 12 days | 592 V100s | $600,000 |
NVIDIA A100 (2020): 312 TFLOPS, 40/80GB memory, workhorse for 2020-2023 training[26]
NVIDIA H100 (2022): 2-3x faster than A100, becoming standard for frontier models[27]
Google TPU v5e (2023): Pods with 50,944 chips achieving 10 exaFLOPS[28]
| Model | Domain | Release Year | Parameters | Key Objective | Developer |
|---|---|---|---|---|---|
| Word2Vec | NLP | 2013 | 300 dim | Skip-gram/CBOW | |
| ResNet-50 | CV | 2015 | 25M | Image Classification | Microsoft |
| BERT | NLP | 2018 | 340M | MLM, NSP | |
| GPT-3 | NLP | 2020 | 175B | Autoregressive LM | OpenAI |
| RoBERTa | NLP | 2019 | 355M | Dynamic MLM | |
| T5 | NLP | 2019 | 11B | Text-to-Text | |
| Vision Transformer | CV | 2020 | 86M-632M | Image Classification | |
| CLIP | Multimodal | 2021 | 400M | Contrastive Alignment | OpenAI |
| DALL-E | Multimodal | 2021 | 12B | Text-to-Image | OpenAI |
| ELECTRA | NLP | 2020 | 340M | Replaced Token Detection | |
| XLNet | NLP | 2019 | 340M | Permutation LM | Google/CMU |
| Llama 2 | NLP | 2023 | 7B-70B | Autoregressive LM | Meta |
| Flamingo | Multimodal | 2022 | 80B | Visual Language | DeepMind |
Pre-trained language models power nearly all modern NLP applications:[14][29]
Question answering: Models achieve 89.91% F1 on SQuAD 2.0, approaching human performance[30]
Code generation: Microsoft CEO Satya Nadella stated in March 2023 that GitHub Copilot was writing 46% of code in files where it is enabled[31]
Conversational AI: Powering chatbots and virtual assistants like ChatGPT
Machine translation: Near-human quality on many language pairs
Sentiment analysis: 90%+ accuracy for review and social media monitoring
Text summarization: Condensing documents while preserving key information
Pre-trained vision models serve as backbones for diverse applications:[32]
Image classification: Vision Transformers achieve 88.5%+ ImageNet accuracy[33]
Object detection: 50+ box AP on COCO
Medical imaging: 90%+ accuracy for pathology detection, cancer screening in CT scans and MRIs
Autonomous vehicles: Real-time object detection for pedestrians, vehicles, traffic signs
Industrial automation: Quality control, safety monitoring, defect detection
Speech Recognition: Pre-training builds robust speech-to-text systems less sensitive to accents and noise[34]
Text-to-image generation: Stable Diffusion uses CLIP embeddings for image synthesis[35]
Visual question answering: Flamingo achieves state-of-the-art on 16 benchmarks[36]
Zero-shot classification: CLIP matches supervised models without task-specific training
Pre-training offers several critical advantages:[37]
Resource efficiency: Reduces labeled data requirements by 10-100x
Faster development: Fine-tuning takes hours/days vs. weeks/months from scratch
Better performance: Pre-trained models consistently outperform random initialization
Transfer learning: Knowledge transfers across related tasks and domains
Democratization: Smaller teams can leverage frontier model capabilities
Generalization: Models learn robust features that work across diverse applications
The computational demands of pre-training create significant environmental costs:
Carbon emissions: GPT-3 training produced an estimated 552 metric tons of CO2 (about 1.2 million pounds), per a 2021 Google and UC Berkeley analysis[38]
Water consumption: Estimated 700,000 liters for cooling during GPT-3 training[39]
Energy use: ChatGPT queries have been estimated to use roughly 10x more energy than a Google search[40]
Pre-trained models inherit and amplify societal biases present in training data:[41]
Gender bias: Models associate professions with specific genders (nurse→women, engineer→men)[42]
Racial and ethnic bias: Preference for stereotypically Caucasian names in leadership recommendations[42]
Linguistic bias: C4 filters African American English at 42% vs 6.2% for White American English[43]
Disability bias: Perpetuation of negative stereotypes about people with disabilities[44]
Copyright concerns: Active litigation regarding training on copyrighted content
Privacy violations: Models may memorize and reproduce personal information
Data contamination: Benchmarks appearing in training data inflate scores
Data quality: Web-scraped data contains misinformation, toxicity, and bias
Geographic concentration: The 2024 Stanford AI Index reported that the United States produced 40 notable AI models in 2023, far more than China's 15[45]
Compute barriers: Training frontier models requires $10-100M+ in resources
Hardware costs: A single H100 GPU has carried list and street prices in the $25,000-40,000 range
Technical expertise: Requires specialized knowledge in distributed systems and optimization
Mixture of Experts: Mixtral 8x7B achieves 70B performance with 13B active parameters[46]
Knowledge distillation: Creating smaller models matching larger model performance
Quantization: Reducing precision from FP16 to INT8/INT4 with minimal accuracy loss
Flash Attention: 2-4x speedup through optimized attention computation[47]
Multimodal pre-training: Models processing text, images, audio, and video seamlessly[49]
Continual pre-training: Updating models with new data without full retraining[50]
Domain adaptation: Specializing models for medicine, law, science
Lifelong learning: Overcoming catastrophic forgetting to learn continuously[51]
Some researchers suggest the era of scaling pre-training datasets is ending:[52]
Peak data hypothesis: Exhausting high-quality public data sources
Post-training focus: Greater emphasis on alignment techniques like RLHF
Constitutional AI: Self-improvement guided by explicit principles[53]
Direct Preference Optimization: More efficient alternatives to RLHF[54]
Agentic AI: Systems learning from environment interaction rather than static datasets
Fine-tuning
Transformer (machine learning model)
Zero-shot learning
Masked language modeling
Contrastive learning
Word embeddings
Hugging Face Model Hub - Repository of pre-trained models
BERT GitHub Repository - Original BERT implementation
GPT-3 Applications - Examples of GPT-3 use cases
TensorFlow Hub - Pre-trained model repository
PyTorch Hub - Pre-trained models for PyTorch
Common Crawl - Large-scale web crawl data
ImageNet - Visual database for object recognition