Pre-training
Last reviewed
Apr 13, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 2,272 words
Pre-training is a foundational machine learning paradigm where neural network models first learn general representations from massive unlabeled datasets before being adapted to specific downstream tasks through fine-tuning. This two-stage approach has revolutionized artificial intelligence across domains, enabling models to achieve superior performance with dramatically less task-specific labeled data than training from scratch.[1]
Pre-training is the initial training phase of a model on a broad dataset or task to learn general patterns and representations before being fine-tuned on a specific problem.[2] In this stage, a model (often called a foundation model when it is general-purpose) is trained on large-scale data, frequently using unlabeled data and self-supervised learning objectives, to acquire a broad understanding of features or knowledge.[3]
Modern foundation models like GPT-4, BERT, and CLIP derive their capabilities primarily from pre-training on trillions of tokens or billions of images, learning rich patterns that transfer effectively across countless downstream applications.[3] Pre-training addresses the fundamental challenge of data scarcity: rather than requiring millions of labeled examples for each task, models pre-trained on general data can adapt to new tasks with mere thousands or even dozens of examples.
History
Early foundations (2006-2012)
Pre-training's conceptual roots trace to 2006-2007 when Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh introduced unsupervised layer-wise pre-training for deep belief networks using Restricted Boltzmann Machines.[4] This breakthrough enabled training deep networks that were previously intractable due to vanishing gradient problems—each layer pre-trained unsupervised, then the full network fine-tuned with supervision.[5]
The paradigm shifted dramatically in 2012 when AlexNet won ImageNet with a top-5 error of 15.3% compared to 26.2% for second place—an unprecedented 10.9% margin.[6] This established supervised large-scale pre-training as the dominant transfer learning approach for computer vision.
Word embeddings era (2013-2017)
Word embeddings emerged as natural language processing's first scalable pre-training method. Word2Vec (2013) introduced CBOW and Skip-gram architectures that learned dense vector representations capturing semantic relationships.[7] GloVe (2014) combined global matrix factorization with local context windows.[8] These static embeddings gave way to contextualized representations with ELMo (2018), which used bidirectional LSTMs to generate different embeddings for the same word based on context.[9]
Transformer revolution (2017-present)
The June 2017 paper "Attention Is All You Need" introduced transformers, fundamentally changing AI.[10] The architecture eliminated recurrence and convolution, relying entirely on multi-head self-attention mechanisms that compute relationships between all positions in parallel.
BERT's October 2018 release demonstrated transformers' power for language understanding through bidirectional pre-training using masked language modeling.[11] GPT paralleled BERT with a unidirectional approach, with GPT-3 (2020) demonstrating that scale plus autoregressive pre-training yields emergent capabilities like few-shot learning.[12]
Core Concepts
The Two-Stage Paradigm: Pre-training and Fine-tuning
The development of large-scale AI models is now dominated by a two-stage paradigm that separates general knowledge acquisition from task-specific adaptation.[13]
- Stage 1: Pre-training: In this initial, computationally intensive phase, a model is trained on a massive, often unlabeled, dataset. The objective is typically self-supervised, such as predicting the next word in a sentence or filling in missing parts of an image.[13] This stage is where the model learns fundamental concepts, from the grammar and syntax of language to the textures and shapes of visual objects. The result of this stage is a pre-trained model, which serves as a versatile foundation.[5]
- Stage 2: Fine-tuning: The pre-trained model is then adapted for a specific application by continuing the training process on a much smaller, task-specific, and typically labeled dataset. For example, a language model pre-trained on the entire internet might be fine-tuned on a dataset of customer reviews to perform sentiment analysis.[11] This step adjusts the model's pre-existing parameters to specialize its knowledge for the target task.[14]
This two-stage approach represents a fundamental philosophical shift in machine learning, moving the field away from building highly specialized, single-task models from scratch toward creating generalist, reusable foundation models.
Relationship with Transfer Learning
Pre-training is the core mechanism that enables transfer learning.[14][15] Transfer learning is a machine learning method where a model developed for one task is reused as the starting point for a model on a second, related task.[15] The pre-trained model is the tangible artifact that stores the knowledge to be transferred.
This transfer can be implemented in two primary ways:
- Feature Extraction: The pre-trained model is used as a fixed feature extractor with parameters frozen, and only new task-specific layers are trained.[16]
- Fine-tuning: Some or all of the pre-trained model's parameters are updated during training on the new dataset, allowing deeper adaptation to the new task.[16]
Technical Approach
Core Mechanics
Pre-training operates through self-supervised learning objectives that enable models to extract knowledge from unlabeled data at massive scale. The process involves:
- Data collection: Gathering datasets from sources like Common Crawl (15+ trillion tokens) or ImageNet (14+ million images)
- Pretext task design: Creating self-supervised objectives like predicting masked words, forecasting next tokens, or matching image-text pairs
- Training process:
- Forward propagation through deep neural networks (typically transformers with 12-96 layers)
- Calculating loss against self-supervised objectives
- Backpropagating gradients
- Updating billions of parameters using optimizers like Adam or AdamW
Modern pre-training runs for weeks to months on thousands of GPUs or TPUs, processing datasets measured in terabytes.[17]
Pre-training Objectives
Masked Language Modeling (MLM)
Popularized by BERT, MLM randomly masks approximately 15% of input tokens and trains models to predict them using full bidirectional context.[11] The masking strategy:
- 80% of selected tokens become [MASK]
- 10% swap to random tokens
- 10% remain unchanged
Variants include:
- RoBERTa: Removed next sentence prediction, used dynamic masking, trained on 10x more data[18]
- SpanBERT: Masks contiguous spans rather than individual tokens[19]
- ELECTRA: Discriminative pre-training detecting replaced tokens[20]
Autoregressive Language Modeling
The GPT family employs autoregressive pre-training, predicting each token given all previous tokens in left-to-right fashion.[21] The objective maximizes likelihood: P(x₁, ..., xₙ) = ∏P(xᵢ|x₁, ..., xᵢ₋₁)
Contrastive Learning
Contrastive learning revolutionized self-supervised pre-training in computer vision and multimodal domains:
- SimCLR: Learns representations maximizing agreement between augmented views of same image[22]
- CLIP: Jointly trains image and text encoders on 400 million image-text pairs[23]
- MAE: Masks 75% of image patches and reconstructs missing pixels[24]
| Feature | Generative Pre-training | Contrastive Pre-training |
|---|---|---|
| Core Objective | Reconstruct or predict parts of the input data; model the data distribution | Learn an embedding space where similar samples are close and dissimilar samples are far apart |
| Supervision Signal | The input data itself (for example the original unmasked token, the complete image) | The relationship between pairs of data points (positive vs. negative pairs) |
| Typical Architectures | Autoencoders (AEs, VAEs), GANs, Autoregressive Models (for example Transformers) | Siamese Networks, models using InfoNCE or Triplet Loss objectives (for example SimCLR, MoCo) |
| Example Pretext Tasks | Masked Language Modeling (BERT), Next-Token Prediction (GPT), Image Inpainting, Denoising | Identifying an augmented version of an image from a batch of other images |
| Strengths | Can generate new data; learns a rich, dense representation of the data distribution | Often learns representations highly effective for downstream classification tasks; can be more sample-efficient |
| Weaknesses | Can be computationally expensive; may have inferior data scaling capacity | Can be data-hungry and prone to over-fitting on limited data; sensitive to negative sample choice |
Datasets
Language Datasets
| Dataset | Size | Description | Used by |
|---|---|---|---|
| Common Crawl | 320+ TB raw | Web crawl data | Most modern LLMs |
| C4 | 750 GB | Cleaned Common Crawl | T5, many others |
| The Pile | 825 GB | 22 diverse sources | GPT-Neo, GPT-J |
| RefinedWeb | 5+ trillion tokens | Filtered Common Crawl | Falcon |
| RedPajama | 1.2 trillion tokens | Open reproduction of LLaMA data | Open models |
| BookCorpus | 800M words | 11,000+ books | BERT, GPT |
| Wikipedia | 2.5B words (English) | Encyclopedia articles | BERT, GPT, most LLMs |
Vision Datasets
| Dataset | Size | Description | Primary use |
|---|---|---|---|
| ImageNet | 1.2M images | 1000 object classes | Supervised pre-training |
| JFT-300M | 300M images | Google internal dataset | Large-scale pre-training |
| LAION-5B | 5.85B pairs | Image-text pairs from web | CLIP-style training |
| DataComp | 12.8B pairs | CommonPool for research | Multimodal research |
| COCO | 330K images | Object detection/segmentation | Vision tasks |
Computational Requirements
Training Costs
| Model | Parameters | Training time | Hardware | Estimated cost |
|---|---|---|---|---|
| BERT-Base | 110M | 4 days | 16 TPUs | $500-1,000 |
| GPT-3 | 175B | ~34 days | 1024 A100s | $4.6 million |
| Llama 2 (7B) | 7B | 1-2 weeks | 64-128 A100s | $200,000-500,000 |
| Llama 3.1 (405B) | 405B | 30.84M GPU-hours | 24,576 H100s | $10-20 million |
| T5 | 11B | 4 weeks | 256 TPU v3 | $1.5 million |
| CLIP | 400M | 12 days | 592 V100s | $600,000 |
Hardware Evolution
- NVIDIA A100 (2020): 312 TFLOPS, 40/80GB memory, workhorse for 2020-2023 training[25]
- NVIDIA H100 (2022): 2-3x faster than A100, becoming standard for frontier models[26]
- Google TPU v5e (2023): Pods with 50,944 chips achieving 10 exaFLOPS[27]
Notable Pre-Trained Models
| Model | Domain | Release Year | Parameters | Key Objective | Developer |
|---|---|---|---|---|---|
| Word2Vec | NLP | 2013 | 300 dim | Skip-gram/CBOW | |
| ResNet-50 | CV | 2015 | 25M | Image Classification | Microsoft |
| BERT | NLP | 2018 | 340M | MLM, NSP | |
| GPT-3 | NLP | 2020 | 175B | Autoregressive LM | OpenAI |
| RoBERTa | NLP | 2019 | 355M | Dynamic MLM | |
| T5 | NLP | 2019 | 11B | Text-to-Text | |
| Vision Transformer | CV | 2020 | 86M-632M | Image Classification | |
| CLIP | Multimodal | 2021 | 400M | Contrastive Alignment | OpenAI |
| DALL-E | Multimodal | 2021 | 12B | Text-to-Image | OpenAI |
| ELECTRA | NLP | 2020 | 340M | Replaced Token Detection | |
| XLNet | NLP | 2019 | 340M | Permutation LM | Google/CMU |
| Llama 2 | NLP | 2023 | 7B-70B | Autoregressive LM | Meta |
| Flamingo | Multimodal | 2022 | 80B | Visual Language | DeepMind |
Applications
Natural Language Processing
Pre-trained language models power nearly all modern NLP applications:[14][28]
- Question answering: Models achieve 89.91% F1 on SQuAD 2.0, approaching human performance[29]
- Code generation: GitHub Copilot assists with 43% of code written by developers[30]
- Conversational AI: Powering chatbots and virtual assistants like ChatGPT
- Machine translation: Near-human quality on many language pairs
- Sentiment analysis: 90%+ accuracy for review and social media monitoring
- Text summarization: Condensing documents while preserving key information
Computer Vision
Pre-trained vision models serve as backbones for diverse applications:[31]
- Image classification: Vision Transformers achieve 88.5%+ ImageNet accuracy[32]
- Object detection: 50+ box AP on COCO
- Medical imaging: 90%+ accuracy for pathology detection, cancer screening in CT scans and MRIs
- Autonomous vehicles: Real-time object detection for pedestrians, vehicles, traffic signs
- Industrial automation: Quality control, safety monitoring, defect detection
Speech and Multimodal Systems
- Speech Recognition: Pre-training builds robust speech-to-text systems less sensitive to accents and noise[33]
- Text-to-image generation: Stable Diffusion uses CLIP embeddings for image synthesis[34]
- Visual question answering: Flamingo achieves state-of-the-art on 16 benchmarks[35]
- Zero-shot classification: CLIP matches supervised models without task-specific training
Benefits and Advantages
Pre-training offers several critical advantages:[36]
- Resource efficiency: Reduces labeled data requirements by 10-100x
- Faster development: Fine-tuning takes hours/days vs. weeks/months from scratch
- Better performance: Pre-trained models consistently outperform random initialization
- Transfer learning: Knowledge transfers across related tasks and domains
- Democratization: Smaller teams can leverage frontier model capabilities
- Generalization: Models learn robust features that work across diverse applications
Challenges and Limitations
Environmental Impact
The computational demands of pre-training create significant environmental costs:
- Carbon emissions: GPT-3 training produced ~626,000 pounds of CO₂[37]
- Water consumption: Estimated 700,000 liters for cooling during GPT-3 training[38]
- Energy use: ChatGPT queries use ~10x more energy than Google searches[39]
Bias, Fairness, and Ethical Considerations
Pre-trained models inherit and amplify societal biases present in training data:[40]
- Gender bias: Models associate professions with specific genders (nurse→women, engineer→men)[41]
- Racial and ethnic bias: Preference for stereotypically Caucasian names in leadership recommendations[41]
- Linguistic bias: C4 filters African American English at 42% vs 6.2% for White American English[42]
- Disability bias: Perpetuation of negative stereotypes about people with disabilities[43]
Data and Legal Issues
- Copyright concerns: Active litigation regarding training on copyrighted content
- Privacy violations: Models may memorize and reproduce personal information
- Data contamination: Benchmarks appearing in training data inflate scores
- Data quality: Web-scraped data contains misinformation, toxicity, and bias
Accessibility and Centralization
- Geographic concentration: US produces 5x more foundation models than China[44]
- Compute barriers: Training frontier models requires $10-100M+ in resources
- Hardware costs: Single H100 GPU costs $30,000-40,000
- Technical expertise: Requires specialized knowledge in distributed systems and optimization
Future Directions
Efficiency Improvements
- Mixture of Experts: Mixtral 8x7B achieves 70B performance with 13B active parameters[45]
- Knowledge distillation: Creating smaller models matching larger model performance
- Quantization: Reducing precision from FP16 to INT8/INT4 with minimal accuracy loss
- Flash Attention: 2-4x speedup through optimized attention computation[46]
- ALBERT: Parameter sharing reduces model size by 18x[47]
Multimodal and Lifelong Learning
- Multimodal pre-training: Models processing text, images, audio, and video seamlessly[48]
- Continual pre-training: Updating models with new data without full retraining[49]
- Domain adaptation: Specializing models for medicine, law, science
- Lifelong learning: Overcoming catastrophic forgetting to learn continuously[50]
Beyond Pre-training: Peak Data and Alignment
Some researchers suggest the era of scaling pre-training datasets is ending:[51]
- Peak data hypothesis: Exhausting high-quality public data sources
- Post-training focus: Greater emphasis on alignment techniques like RLHF
- Constitutional AI: Self-improvement guided by explicit principles[52]
- Direct Preference Optimization: More efficient alternatives to RLHF[53]
- Agentic AI: Systems learning from environment interaction rather than static datasets
See also
- Transfer learning
- Foundation models
- Fine-tuning
- Self-supervised learning
- Transformer (machine learning model)
- BERT
- GPT-3
- Large language model
- Zero-shot learning
- Few-shot learning
- Masked language modeling
- Contrastive learning
- Word embeddings
References
External links
- Hugging Face Model Hub - Repository of pre-trained models
- BERT GitHub Repository - Original BERT implementation
- GPT-3 Applications - Examples of GPT-3 use cases
- TensorFlow Hub - Pre-trained model repository
- PyTorch Hub - Pre-trained models for PyTorch
- Common Crawl - Large-scale web crawl data
- ImageNet - Visual database for object recognition
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.