Pre-training
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 · 2,548 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 · 2,548 words
Add missing citations, update stale details, or suggest a clearer explanation.
Pre-training is a foundational machine learning paradigm where neural network models first learn general representations from massive unlabeled datasets before being adapted to specific downstream tasks through fine-tuning. This two-stage approach has revolutionized artificial intelligence across domains, enabling models to achieve superior performance with dramatically less task-specific labeled data than training from scratch.[1]
Pre-training is the initial training phase of a model on a broad dataset or task to learn general patterns and representations before being fine-tuned on a specific problem.[2] In this stage, a model (often called a foundation model when it is general-purpose) is trained on large-scale data, frequently using unlabeled data and self-supervised learning objectives, to acquire a broad understanding of features or knowledge.[3]
Modern foundation models like GPT-4, BERT, and CLIP derive their capabilities primarily from pre-training on trillions of tokens or billions of images, learning rich patterns that transfer effectively across countless downstream applications.[3] Pre-training addresses the fundamental challenge of data scarcity: rather than requiring millions of labeled examples for each task, models pre-trained on general data can adapt to new tasks with mere thousands or even dozens of examples.
Pre-training's conceptual roots trace to 2006-2007 when Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh introduced unsupervised layer-wise pre-training for deep belief networks using Restricted Boltzmann Machines.[4] This breakthrough enabled training deep networks that were previously intractable due to vanishing gradient problems; each layer pre-trained unsupervised, then the full network fine-tuned with supervision.[5]
The paradigm shifted dramatically in 2012 when AlexNet won ImageNet with a top-5 error of 15.3% compared to 26.2% for second place, an unprecedented 10.9% margin.[6] This established supervised large-scale pre-training as the dominant transfer learning approach for computer vision.
Word embeddings emerged as natural language processing's first scalable pre-training method. Word2Vec (2013) introduced CBOW and Skip-gram architectures that learned dense vector representations capturing semantic relationships.[7] GloVe (2014) combined global matrix factorization with local context windows.[8] These static embeddings gave way to contextualized representations with ELMo (2018), which used bidirectional LSTMs to generate different embeddings for the same word based on context.[9]
The June 2017 paper "Attention Is All You Need" introduced transformers, fundamentally changing AI.[10] The architecture eliminated recurrence and convolution, relying entirely on multi-head self-attention mechanisms that compute relationships between all positions in parallel.
BERT's October 2018 release demonstrated transformers' power for language understanding through bidirectional pre-training using masked language modeling.[11] GPT paralleled BERT with a unidirectional approach, with GPT-3 (2020) demonstrating that scale plus autoregressive pre-training yields emergent capabilities like few-shot learning.[12]
The development of large-scale AI models is now dominated by a two-stage paradigm that separates general knowledge acquisition from task-specific adaptation.[13]
Stage 1: Pre-training: In this initial, computationally intensive phase, a model is trained on a massive, often unlabeled, dataset. The objective is typically self-supervised, such as predicting the next word in a sentence or filling in missing parts of an image.[13] This stage is where the model learns fundamental concepts, from the grammar and syntax of language to the textures and shapes of visual objects. The result of this stage is a pre-trained model, which serves as a versatile foundation.[5]
Stage 2: Fine-tuning: The pre-trained model is then adapted for a specific application by continuing the training process on a much smaller, task-specific, and typically labeled dataset. For example, a language model pre-trained on the entire internet might be fine-tuned on a dataset of customer reviews to perform sentiment analysis.[11] This step adjusts the model's pre-existing parameters to specialize its knowledge for the target task.[14]
This two-stage approach represents a fundamental philosophical shift in machine learning, moving the field away from building highly specialized, single-task models from scratch toward creating generalist, reusable foundation models.
Pre-training is the core mechanism that enables transfer learning.[14][15] Transfer learning is a machine learning method where a model developed for one task is reused as the starting point for a model on a second, related task.[15] The pre-trained model is the tangible artifact that stores the knowledge to be transferred.
This transfer can be implemented in two primary ways:
Feature Extraction: The pre-trained model is used as a fixed feature extractor with parameters frozen, and only new task-specific layers are trained.[16]
Fine-tuning: Some or all of the pre-trained model's parameters are updated during training on the new dataset, allowing deeper adaptation to the new task.[16]
Pre-training operates through self-supervised learning objectives that enable models to extract knowledge from unlabeled data at massive scale. The process involves:
Data collection: Gathering datasets from sources like Common Crawl (15+ trillion tokens) or ImageNet (14+ million images)
Pretext task design: Creating self-supervised objectives like predicting masked words, forecasting next tokens, or matching image-text pairs
Training process:
Forward propagation through deep neural networks (typically transformers with 12-96 layers)
Calculating loss against self-supervised objectives
Backpropagating gradients
Updating billions of parameters using optimizers like Adam or AdamW
Modern pre-training runs for weeks to months on thousands of GPUs or TPUs, processing datasets measured in terabytes.[17]
Popularized by BERT, MLM randomly masks approximately 15% of input tokens and trains models to predict them using full bidirectional context.[11] The masking strategy:
80% of selected tokens become [MASK]
10% swap to random tokens
10% remain unchanged
Variants include:
RoBERTa: Removed next sentence prediction, used dynamic masking, trained on 10x more data[18]
SpanBERT: Masks contiguous spans rather than individual tokens[19]
ELECTRA: Discriminative pre-training detecting replaced tokens[20]
The GPT family employs autoregressive pre-training, predicting each token given all previous tokens in left-to-right fashion.[21] The objective maximizes likelihood: P(x₁, ..., xₙ) = ∏P(xᵢ|x₁, ..., xᵢ₋₁)
Contrastive learning revolutionized self-supervised pre-training in computer vision and multimodal domains:
SimCLR: Learns representations maximizing agreement between augmented views of same image[22]
CLIP: Jointly trains image and text encoders on 400 million image-text pairs[23]
MAE: Masks 75% of image patches and reconstructs missing pixels[24]
| Feature | Generative Pre-training | Contrastive Pre-training |
|---|---|---|
| Core Objective | Reconstruct or predict parts of the input data; model the data distribution | Learn an embedding space where similar samples are close and dissimilar samples are far apart |
| Supervision Signal | The input data itself (for example the original unmasked token, the complete image) | The relationship between pairs of data points (positive vs. negative pairs) |
| Typical Architectures | Autoencoders (AEs, VAEs), GANs, Autoregressive Models (for example Transformers) | Siamese Networks, models using InfoNCE or Triplet Loss objectives (for example SimCLR, MoCo) |
| Example Pretext Tasks | Masked Language Modeling (BERT), Next-Token Prediction (GPT), Image Inpainting, Denoising | Identifying an augmented version of an image from a batch of other images |
| Strengths | Can generate new data; learns a rich, dense representation of the data distribution | Often learns representations highly effective for downstream classification tasks; can be more sample-efficient |
| Weaknesses | Can be computationally expensive; may have inferior data scaling capacity | Can be data-hungry and prone to over-fitting on limited data; sensitive to negative sample choice |
| Dataset | Size | Description | Used by |
|---|---|---|---|
| Common Crawl | 320+ TB raw | Web crawl data | Most modern LLMs |
| C4 | 750 GB | Cleaned Common Crawl | T5, many others |
| The Pile | 825 GB | 22 diverse sources | GPT-Neo, GPT-J |
| RefinedWeb | 5+ trillion tokens | Filtered Common Crawl | Falcon |
| RedPajama | 1.2 trillion tokens | Open reproduction of LLaMA data | Open models |
| BookCorpus | 800M words | 11,000+ books | BERT, GPT |
| Wikipedia | 2.5B words (English) | Encyclopedia articles | BERT, GPT, most LLMs |
| Dataset | Size | Description | Primary use |
|---|---|---|---|
| ImageNet | 1.2M images | 1000 object classes | Supervised pre-training |
| JFT-300M | 300M images | Google internal dataset | Large-scale pre-training |
| LAION-5B | 5.85B pairs | Image-text pairs from web | CLIP-style training |
| DataComp | 12.8B pairs | CommonPool for research | Multimodal research |
| COCO | 330K images | Object detection/segmentation | Vision tasks |
| Model | Parameters | Training time | Hardware | Estimated cost |
|---|---|---|---|---|
| BERT-Base | 110M | 4 days | 16 TPUs | $500-1,000 |
| GPT-3 | 175B | ~34 days | 1024 A100s | $4.6 million |
| Llama 2 (7B) | 7B | 1-2 weeks | 64-128 A100s | $200,000-500,000 |
| Llama 3.1 (405B) | 405B | 30.84M GPU-hours | 24,576 H100s | $10-20 million |
| T5 | 11B | 4 weeks | 256 TPU v3 | $1.5 million |
| CLIP | 400M | 12 days | 592 V100s | $600,000 |
NVIDIA A100 (2020): 312 TFLOPS, 40/80GB memory, workhorse for 2020-2023 training[25]
NVIDIA H100 (2022): 2-3x faster than A100, becoming standard for frontier models[26]
Google TPU v5e (2023): Pods with 50,944 chips achieving 10 exaFLOPS[27]
| Model | Domain | Release Year | Parameters | Key Objective | Developer |
|---|---|---|---|---|---|
| Word2Vec | NLP | 2013 | 300 dim | Skip-gram/CBOW | |
| ResNet-50 | CV | 2015 | 25M | Image Classification | Microsoft |
| BERT | NLP | 2018 | 340M | MLM, NSP | |
| GPT-3 | NLP | 2020 | 175B | Autoregressive LM | OpenAI |
| RoBERTa | NLP | 2019 | 355M | Dynamic MLM | |
| T5 | NLP | 2019 | 11B | Text-to-Text | |
| Vision Transformer | CV | 2020 | 86M-632M | Image Classification | |
| CLIP | Multimodal | 2021 | 400M | Contrastive Alignment | OpenAI |
| DALL-E | Multimodal | 2021 | 12B | Text-to-Image | OpenAI |
| ELECTRA | NLP | 2020 | 340M | Replaced Token Detection | |
| XLNet | NLP | 2019 | 340M | Permutation LM | Google/CMU |
| Llama 2 | NLP | 2023 | 7B-70B | Autoregressive LM | Meta |
| Flamingo | Multimodal | 2022 | 80B | Visual Language | DeepMind |
Pre-trained language models power nearly all modern NLP applications:[14][28]
Question answering: Models achieve 89.91% F1 on SQuAD 2.0, approaching human performance[29]
Code generation: GitHub Copilot assists with 43% of code written by developers[30]
Conversational AI: Powering chatbots and virtual assistants like ChatGPT
Machine translation: Near-human quality on many language pairs
Sentiment analysis: 90%+ accuracy for review and social media monitoring
Text summarization: Condensing documents while preserving key information
Pre-trained vision models serve as backbones for diverse applications:[31]
Image classification: Vision Transformers achieve 88.5%+ ImageNet accuracy[32]
Object detection: 50+ box AP on COCO
Medical imaging: 90%+ accuracy for pathology detection, cancer screening in CT scans and MRIs
Autonomous vehicles: Real-time object detection for pedestrians, vehicles, traffic signs
Industrial automation: Quality control, safety monitoring, defect detection
Speech Recognition: Pre-training builds robust speech-to-text systems less sensitive to accents and noise[33]
Text-to-image generation: Stable Diffusion uses CLIP embeddings for image synthesis[34]
Visual question answering: Flamingo achieves state-of-the-art on 16 benchmarks[35]
Zero-shot classification: CLIP matches supervised models without task-specific training
Pre-training offers several critical advantages:[36]
Resource efficiency: Reduces labeled data requirements by 10-100x
Faster development: Fine-tuning takes hours/days vs. weeks/months from scratch
Better performance: Pre-trained models consistently outperform random initialization
Transfer learning: Knowledge transfers across related tasks and domains
Democratization: Smaller teams can leverage frontier model capabilities
Generalization: Models learn robust features that work across diverse applications
The computational demands of pre-training create significant environmental costs:
Carbon emissions: GPT-3 training produced ~626,000 pounds of CO₂[37]
Water consumption: Estimated 700,000 liters for cooling during GPT-3 training[38]
Energy use: ChatGPT queries use ~10x more energy than Google searches[39]
Pre-trained models inherit and amplify societal biases present in training data:[40]
Gender bias: Models associate professions with specific genders (nurse→women, engineer→men)[41]
Racial and ethnic bias: Preference for stereotypically Caucasian names in leadership recommendations[41]
Linguistic bias: C4 filters African American English at 42% vs 6.2% for White American English[42]
Disability bias: Perpetuation of negative stereotypes about people with disabilities[43]
Copyright concerns: Active litigation regarding training on copyrighted content
Privacy violations: Models may memorize and reproduce personal information
Data contamination: Benchmarks appearing in training data inflate scores
Data quality: Web-scraped data contains misinformation, toxicity, and bias
Geographic concentration: US produces 5x more foundation models than China[44]
Compute barriers: Training frontier models requires $10-100M+ in resources
Hardware costs: Single H100 GPU costs $30,000-40,000
Technical expertise: Requires specialized knowledge in distributed systems and optimization
Mixture of Experts: Mixtral 8x7B achieves 70B performance with 13B active parameters[45]
Knowledge distillation: Creating smaller models matching larger model performance
Quantization: Reducing precision from FP16 to INT8/INT4 with minimal accuracy loss
Flash Attention: 2-4x speedup through optimized attention computation[46]
Multimodal pre-training: Models processing text, images, audio, and video seamlessly[48]
Continual pre-training: Updating models with new data without full retraining[49]
Domain adaptation: Specializing models for medicine, law, science
Lifelong learning: Overcoming catastrophic forgetting to learn continuously[50]
Some researchers suggest the era of scaling pre-training datasets is ending:[51]
Peak data hypothesis: Exhausting high-quality public data sources
Post-training focus: Greater emphasis on alignment techniques like RLHF
Constitutional AI: Self-improvement guided by explicit principles[52]
Direct Preference Optimization: More efficient alternatives to RLHF[53]
Agentic AI: Systems learning from environment interaction rather than static datasets
Fine-tuning
Transformer (machine learning model)
Zero-shot learning
Masked language modeling
Contrastive learning
Word embeddings
Hugging Face Model Hub - Repository of pre-trained models
BERT GitHub Repository - Original BERT implementation
GPT-3 Applications - Examples of GPT-3 use cases
TensorFlow Hub - Pre-trained model repository
PyTorch Hub - Pre-trained models for PyTorch
Common Crawl - Large-scale web crawl data
ImageNet - Visual database for object recognition