Pre-training

From AI Wiki

Template:Infobox machine learning technique

Pre-training is a foundational machine learning paradigm where neural network models first learn general representations from massive unlabeled datasets before being adapted to specific downstream tasks through fine-tuning. This two-stage approach has revolutionized artificial intelligence across domains, enabling models to achieve superior performance with dramatically less task-specific labeled data than training from scratch.[1]

Pre-training is the initial training phase of a model on a broad dataset or task to learn general patterns and representations before being fine-tuned on a specific problem.[2] In this stage, a model (often called a foundation model when it is general-purpose) is trained on large-scale data, frequently using unlabeled data and self-supervised learning objectives, to acquire a broad understanding of features or knowledge.[3]

Modern foundation models like GPT-4, BERT, and CLIP derive their capabilities primarily from pre-training on trillions of tokens or billions of images, learning rich patterns that transfer effectively across countless downstream applications.[3] Pre-training addresses the fundamental challenge of data scarcity: rather than requiring millions of labeled examples for each task, models pre-trained on general data can adapt to new tasks with mere thousands or even dozens of examples.

History

Early foundations (2006-2012)

Pre-training's conceptual roots trace to 2006-2007 when Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh introduced unsupervised layer-wise pre-training for deep belief networks using Restricted Boltzmann Machines.[4] This breakthrough enabled training deep networks that were previously intractable due to vanishing gradient problems—each layer pre-trained unsupervised, then the full network fine-tuned with supervision.[5]

The paradigm shifted dramatically in 2012 when AlexNet won ImageNet with a top-5 error of 15.3% compared to 26.2% for second place—an unprecedented 10.9% margin.[6] This established supervised large-scale pre-training as the dominant transfer learning approach for computer vision.

Word embeddings era (2013-2017)

Word embeddings emerged as natural language processing's first scalable pre-training method. Word2Vec (2013) introduced CBOW and Skip-gram architectures that learned dense vector representations capturing semantic relationships.[7] GloVe (2014) combined global matrix factorization with local context windows.[8] These static embeddings gave way to contextualized representations with ELMo (2018), which used bidirectional LSTMs to generate different embeddings for the same word based on context.[9]

Transformer revolution (2017-present)

The June 2017 paper "Attention Is All You Need" introduced transformers, fundamentally changing AI.[10] The architecture eliminated recurrence and convolution, relying entirely on multi-head self-attention mechanisms that compute relationships between all positions in parallel.

BERT's October 2018 release demonstrated transformers' power for language understanding through bidirectional pre-training using masked language modeling.[11] GPT paralleled BERT with a unidirectional approach, with GPT-3 (2020) demonstrating that scale plus autoregressive pre-training yields emergent capabilities like few-shot learning.[12]

Core Concepts

The Two-Stage Paradigm: Pre-training and Fine-tuning

The development of large-scale AI models is now dominated by a two-stage paradigm that separates general knowledge acquisition from task-specific adaptation.[13]

  1. Stage 1: Pre-training: In this initial, computationally intensive phase, a model is trained on a massive, often unlabeled, dataset. The objective is typically self-supervised, such as predicting the next word in a sentence or filling in missing parts of an image.[13] This stage is where the model learns fundamental concepts, from the grammar and syntax of language to the textures and shapes of visual objects. The result of this stage is a pre-trained model, which serves as a versatile foundation.[5]
  2. Stage 2: Fine-tuning: The pre-trained model is then adapted for a specific application by continuing the training process on a much smaller, task-specific, and typically labeled dataset. For example, a language model pre-trained on the entire internet might be fine-tuned on a dataset of customer reviews to perform sentiment analysis.[11] This step adjusts the model's pre-existing parameters to specialize its knowledge for the target task.[14]

This two-stage approach represents a fundamental philosophical shift in machine learning, moving the field away from building highly specialized, single-task models from scratch toward creating generalist, reusable foundation models.

Relationship with Transfer Learning

Pre-training is the core mechanism that enables transfer learning.[14][15] Transfer learning is a machine learning method where a model developed for one task is reused as the starting point for a model on a second, related task.[15] The pre-trained model is the tangible artifact that stores the knowledge to be transferred.

This transfer can be implemented in two primary ways:

  • Feature Extraction: The pre-trained model is used as a fixed feature extractor with parameters frozen, and only new task-specific layers are trained.[16]
  • Fine-tuning: Some or all of the pre-trained model's parameters are updated during training on the new dataset, allowing deeper adaptation to the new task.[16]

Technical Approach

Core Mechanics

Pre-training operates through self-supervised learning objectives that enable models to extract knowledge from unlabeled data at massive scale. The process involves:

  1. Data collection: Gathering datasets from sources like Common Crawl (15+ trillion tokens) or ImageNet (14+ million images)
  2. Pretext task design: Creating self-supervised objectives like predicting masked words, forecasting next tokens, or matching image-text pairs
  3. Training process:
    • Forward propagation through deep neural networks (typically transformers with 12-96 layers)
    • Calculating loss against self-supervised objectives
    • Backpropagating gradients
    • Updating billions of parameters using optimizers like Adam or AdamW

Modern pre-training runs for weeks to months on thousands of GPUs or TPUs, processing datasets measured in terabytes.[17]

Pre-training Objectives

Masked Language Modeling (MLM)

Popularized by BERT, MLM randomly masks approximately 15% of input tokens and trains models to predict them using full bidirectional context.[11] The masking strategy:

  • 80% of selected tokens become [MASK]
  • 10% swap to random tokens
  • 10% remain unchanged

Variants include:

  • RoBERTa: Removed next sentence prediction, used dynamic masking, trained on 10x more data[18]
  • SpanBERT: Masks contiguous spans rather than individual tokens[19]
  • ELECTRA: Discriminative pre-training detecting replaced tokens[20]

Autoregressive Language Modeling

The GPT family employs autoregressive pre-training, predicting each token given all previous tokens in left-to-right fashion.[21] The objective maximizes likelihood: P(x₁, ..., xₙ) = ∏P(xᵢ|x₁, ..., xᵢ₋₁)

Contrastive Learning

Contrastive learning revolutionized self-supervised pre-training in computer vision and multimodal domains:

  • SimCLR: Learns representations maximizing agreement between augmented views of same image[22]
  • CLIP: Jointly trains image and text encoders on 400 million image-text pairs[23]
  • MAE: Masks 75% of image patches and reconstructs missing pixels[24]
Comparative Analysis of Pre-training Paradigms
Feature Generative Pre-training Contrastive Pre-training
Core Objective Reconstruct or predict parts of the input data; model the data distribution Learn an embedding space where similar samples are close and dissimilar samples are far apart
Supervision Signal The input data itself (for example the original unmasked token, the complete image) The relationship between pairs of data points (positive vs. negative pairs)
Typical Architectures Autoencoders (AEs, VAEs), GANs, Autoregressive Models (for example Transformers) Siamese Networks, models using InfoNCE or Triplet Loss objectives (for example SimCLR, MoCo)
Example Pretext Tasks Masked Language Modeling (BERT), Next-Token Prediction (GPT), Image Inpainting, Denoising Identifying an augmented version of an image from a batch of other images
Strengths Can generate new data; learns a rich, dense representation of the data distribution Often learns representations highly effective for downstream classification tasks; can be more sample-efficient
Weaknesses Can be computationally expensive; may have inferior data scaling capacity Can be data-hungry and prone to over-fitting on limited data; sensitive to negative sample choice

Datasets

Language Datasets

Dataset Size Description Used by
Common Crawl 320+ TB raw Web crawl data Most modern LLMs
C4 750 GB Cleaned Common Crawl T5, many others
The Pile 825 GB 22 diverse sources GPT-Neo, GPT-J
RefinedWeb 5+ trillion tokens Filtered Common Crawl Falcon
RedPajama 1.2 trillion tokens Open reproduction of LLaMA data Open models
BookCorpus 800M words 11,000+ books BERT, GPT
Wikipedia 2.5B words (English) Encyclopedia articles BERT, GPT, most LLMs

Vision Datasets

Dataset Size Description Primary use
ImageNet 1.2M images 1000 object classes Supervised pre-training
JFT-300M 300M images Google internal dataset Large-scale pre-training
LAION-5B 5.85B pairs Image-text pairs from web CLIP-style training
DataComp 12.8B pairs CommonPool for research Multimodal research
COCO 330K images Object detection/segmentation Vision tasks

Computational Requirements

Training Costs

Model Parameters Training time Hardware Estimated cost
BERT-Base 110M 4 days 16 TPUs $500-1,000
GPT-3 175B ~34 days 1024 A100s $4.6 million
Llama 2 (7B) 7B 1-2 weeks 64-128 A100s $200,000-500,000
Llama 3.1 (405B) 405B 30.84M GPU-hours 24,576 H100s $10-20 million
T5 11B 4 weeks 256 TPU v3 $1.5 million
CLIP 400M 12 days 592 V100s $600,000

Hardware Evolution

  • NVIDIA A100 (2020): 312 TFLOPS, 40/80GB memory, workhorse for 2020-2023 training[25]
  • NVIDIA H100 (2022): 2-3x faster than A100, becoming standard for frontier models[26]
  • Google TPU v5e (2023): Pods with 50,944 chips achieving 10 exaFLOPS[27]

Notable Pre-Trained Models

Model Domain Release Year Parameters Key Objective Developer
Word2Vec NLP 2013 300 dim Skip-gram/CBOW Google
ResNet-50 CV 2015 25M Image Classification Microsoft
BERT NLP 2018 340M MLM, NSP Google
GPT-3 NLP 2020 175B Autoregressive LM OpenAI
RoBERTa NLP 2019 355M Dynamic MLM Facebook
T5 NLP 2019 11B Text-to-Text Google
Vision Transformer CV 2020 86M-632M Image Classification Google
CLIP Multimodal 2021 400M Contrastive Alignment OpenAI
DALL-E Multimodal 2021 12B Text-to-Image OpenAI
ELECTRA NLP 2020 340M Replaced Token Detection Google
XLNet NLP 2019 340M Permutation LM Google/CMU
Llama 2 NLP 2023 7B-70B Autoregressive LM Meta
Flamingo Multimodal 2022 80B Visual Language DeepMind

Applications

Natural Language Processing

Pre-trained language models power nearly all modern NLP applications:[14][28]

Computer Vision

Pre-trained vision models serve as backbones for diverse applications:[31]

Speech and Multimodal Systems

Benefits and Advantages

Pre-training offers several critical advantages:[36]

  • Resource efficiency: Reduces labeled data requirements by 10-100x
  • Faster development: Fine-tuning takes hours/days vs. weeks/months from scratch
  • Better performance: Pre-trained models consistently outperform random initialization
  • Transfer learning: Knowledge transfers across related tasks and domains
  • Democratization: Smaller teams can leverage frontier model capabilities
  • Generalization: Models learn robust features that work across diverse applications

Challenges and Limitations

Environmental Impact

The computational demands of pre-training create significant environmental costs:

  • Carbon emissions: GPT-3 training produced ~626,000 pounds of CO₂[37]
  • Water consumption: Estimated 700,000 liters for cooling during GPT-3 training[38]
  • Energy use: ChatGPT queries use ~10x more energy than Google searches[39]

Bias, Fairness, and Ethical Considerations

Pre-trained models inherit and amplify societal biases present in training data:[40]

  • Gender bias: Models associate professions with specific genders (nurse→women, engineer→men)[41]
  • Racial and ethnic bias: Preference for stereotypically Caucasian names in leadership recommendations[41]
  • Linguistic bias: C4 filters African American English at 42% vs 6.2% for White American English[42]
  • Disability bias: Perpetuation of negative stereotypes about people with disabilities[43]

Data and Legal Issues

  • Copyright concerns: Active litigation regarding training on copyrighted content
  • Privacy violations: Models may memorize and reproduce personal information
  • Data contamination: Benchmarks appearing in training data inflate scores
  • Data quality: Web-scraped data contains misinformation, toxicity, and bias

Accessibility and Centralization

  • Geographic concentration: US produces 5x more foundation models than China[44]
  • Compute barriers: Training frontier models requires $10-100M+ in resources
  • Hardware costs: Single H100 GPU costs $30,000-40,000
  • Technical expertise: Requires specialized knowledge in distributed systems and optimization

Future Directions

Efficiency Improvements

Multimodal and Lifelong Learning

Beyond Pre-training: Peak Data and Alignment

Some researchers suggest the era of scaling pre-training datasets is ending:[51]

See also

References

  1. Bengio, Y., Courville, A., & Vincent, P. (2013). "Representation Learning: A Review and New Perspectives". IEEE Transactions on Pattern Analysis and Machine Intelligence. https://arxiv.org/abs/1206.5538
  2. NIST CSRC AI Glossary (2025). Definition of "pre-training": "training a general-purpose model on publicly-available data, often followed by fine-tuning for task-specific information." https://csrc.nist.gov/glossary/term/pre_training
  3. 3.0 3.1 Bommasani, R., et al. (2021). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. https://arxiv.org/abs/2108.07258
  4. Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). "A fast learning algorithm for deep belief nets". Neural Computation, 18(7), 1527-1554. https://www.cs.toronto.edu/~hinton/absps/fastnc.pdf
  5. 5.0 5.1 Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., & Bengio, S. (2010). "Why Does Unsupervised Pre-training Help Deep Learning?". Journal of Machine Learning Research. https://jmlr.org/papers/v11/erhan10a.html
  6. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks". NeurIPS. https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
  7. Mikolov, T., et al. (2013). "Efficient Estimation of Word Representations in Vector Space". arXiv. https://arxiv.org/abs/1301.3781
  8. Pennington, J., Socher, R., & Manning, C. (2014). "GloVe: Global Vectors for Word Representation". EMNLP. https://nlp.stanford.edu/pubs/glove.pdf
  9. Peters, M., et al. (2018). "Deep Contextualized Word Representations". NAACL. https://arxiv.org/abs/1802.05365
  10. Vaswani, A., et al. (2017). "Attention Is All You Need". NeurIPS. https://arxiv.org/abs/1706.03762
  11. 11.0 11.1 11.2 Devlin, J., et al. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL. https://arxiv.org/abs/1810.04805
  12. Brown, T., et al. (2020). "Language Models are Few-Shot Learners". NeurIPS. https://arxiv.org/abs/2005.14165
  13. 13.0 13.1 Liu, X., et al. (2020). "Self-supervised Learning: Generative or Contrastive". arXiv. https://arxiv.org/abs/2006.08218
  14. 14.0 14.1 14.2 Lee, Angie (December 8, 2022). "What Is a Pretrained AI Model?". NVIDIA Blog. https://blogs.nvidia.com/blog/what-is-a-pretrained-ai-model/
  15. 15.0 15.1 Amazon Web Services. "What Is Transfer Learning?". https://aws.amazon.com/what-is/transfer-learning/
  16. 16.0 16.1 Dev.to. "Understanding the Differences: Fine-Tuning vs. Transfer Learning". https://dev.to/luxdevhq/understanding-the-differences-fine-tuning-vs-transfer-learning-370
  17. Touvron, H., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models". Meta AI. https://arxiv.org/abs/2307.09288
  18. Liu, Y., et al. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach". arXiv. https://arxiv.org/abs/1907.11692
  19. Joshi, M., et al. (2020). "SpanBERT: Improving Pre-training by Representing and Predicting Spans". TACL. https://arxiv.org/abs/1907.10529
  20. Clark, K., et al. (2020). "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators". ICLR. https://arxiv.org/abs/2003.10555
  21. Radford, A., et al. (2018). "Improving Language Understanding by Generative Pre-Training". OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
  22. Chen, T., et al. (2020). "A Simple Framework for Contrastive Learning of Visual Representations". ICML. https://arxiv.org/abs/2002.05709
  23. Radford, A., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision". ICML. https://arxiv.org/abs/2103.00020
  24. He, K., et al. (2022). "Masked Autoencoders Are Scalable Vision Learners". CVPR. https://arxiv.org/abs/2111.06377
  25. NVIDIA. (2020). "NVIDIA A100 Tensor Core GPU Architecture". https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf
  26. NVIDIA. (2022). "NVIDIA H100 Tensor Core GPU Architecture". https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet
  27. Google Cloud. (2023). "Cloud TPU v5e". https://cloud.google.com/tpu/docs/v5e
  28. Toloka. "What is Pre-training in LLM Development?". https://toloka.ai/blog/pre-training-in-llm-development/
  29. Rajpurkar, P., Jia, R., & Liang, P. (2018). "Know What You Don't Know: Unanswerable Questions for SQuAD". ACL. https://arxiv.org/abs/1806.03822
  30. GitHub. (2024). "GitHub Copilot Impact Research". https://github.blog/news-insights/research/
  31. Viso.ai. "Top 45 Computer Vision Applications in 2024". https://viso.ai/applications/computer-vision-applications/
  32. Dosovitskiy, A., et al. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". ICLR. https://arxiv.org/abs/2010.11929
  33. GeeksForGeeks. "What is Pre-training and its Objective?". https://www.geeksforgeeks.org/artificial-intelligence/what-is-pre-training-and-its-objective/
  34. Rombach, R., et al. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models". CVPR. https://arxiv.org/abs/2112.10752
  35. Alayrac, J. B., et al. (2022). "Flamingo: a Visual Language Model for Few-Shot Learning". NeurIPS. https://arxiv.org/abs/2204.14198
  36. Baeldung (2025). "What Does Pre-training a Neural Network Mean?". https://www.baeldung.com/cs/neural-network-pre-training
  37. Strubell, E., Ganesh, A., & McCallum, A. (2019). "Energy and Policy Considerations for Deep Learning in NLP". ACL. https://arxiv.org/abs/1906.02243
  38. Li, P., et al. (2023). "Making AI Less Thirsty: Uncovering and Addressing the Secret Water Footprint of AI Models". arXiv. https://arxiv.org/abs/2304.03271
  39. Wikipedia. "Environmental impact of artificial intelligence". https://en.wikipedia.org/wiki/Environmental_impact_of_artificial_intelligence
  40. Google Cloud. "What are foundation models?". https://cloud.google.com/discover/what-are-foundation-models
  41. 41.0 41.1 Li, Z., et al. (2024). "Explicitly unbiased large language models still form biased mental models". PNAS. https://www.pnas.org/doi/10.1073/pnas.2416228122
  42. Dodge, J., et al. (2021). "Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus". EMNLP. https://arxiv.org/abs/2104.08758
  43. Venkit, P. N., et al. (2022). "A Study of Implicit Bias in Pretrained Language Models against People with Disabilities". COLING. https://aclanthology.org/2022.coling-1.113/
  44. Maslej, N., et al. (2024). "The AI Index 2024 Annual Report". Stanford HAI. https://aiindex.stanford.edu/report/
  45. Jiang, A. Q., et al. (2024). "Mixtral of Experts". arXiv. https://arxiv.org/abs/2401.04088
  46. Dao, T., et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness". NeurIPS. https://arxiv.org/abs/2205.14135
  47. Lan, Z., et al. (2020). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations". ICLR. https://arxiv.org/abs/1909.11942
  48. MultimodalPretraining.github.io. "Workshop on Multimodal Pre-training". https://multimodalpretraining.github.io/
  49. Ibrahim, M., et al. (2024). "Simple and Scalable Strategies to Continually Pre-train Large Language Models". arXiv. https://arxiv.org/abs/2403.08763
  50. Mehta, S. V. (2023). "Efficient Lifelong Learning in Deep Neural Networks". Carnegie Mellon University. https://kilthub.cmu.edu/articles/thesis/24992883
  51. PDFTranslate.ai. "Ilya Sutskever: LLM Pre-training as we know it will end". https://pdftranslate.ai/blog/llm-end-of-era
  52. Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback". Anthropic. https://arxiv.org/abs/2212.08073
  53. Rafailov, R., et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". NeurIPS. https://arxiv.org/abs/2305.18290

External links