Pre-training

Template:Infobox machine learning technique

Pre-training is a foundational machine learning paradigm where neural network models first learn general representations from massive unlabeled datasets before being adapted to specific downstream tasks through fine-tuning. This two-stage approach has revolutionized artificial intelligence across domains, enabling models to achieve superior performance with dramatically less task-specific labeled data than training from scratch.^[1]

Pre-training is the initial training phase of a model on a broad dataset or task to learn general patterns and representations before being fine-tuned on a specific problem.^[2] In this stage, a model (often called a foundation model when it is general-purpose) is trained on large-scale data, frequently using unlabeled data and self-supervised learning objectives, to acquire a broad understanding of features or knowledge.^[3]

Modern foundation models like GPT-4, BERT, and CLIP derive their capabilities primarily from pre-training on trillions of tokens or billions of images, learning rich patterns that transfer effectively across countless downstream applications.^[3] Pre-training addresses the fundamental challenge of data scarcity: rather than requiring millions of labeled examples for each task, models pre-trained on general data can adapt to new tasks with mere thousands or even dozens of examples.

History

Early foundations (2006-2012)

Pre-training's conceptual roots trace to 2006-2007 when Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh introduced unsupervised layer-wise pre-training for deep belief networks using Restricted Boltzmann Machines.^[4] This breakthrough enabled training deep networks that were previously intractable due to vanishing gradient problems—each layer pre-trained unsupervised, then the full network fine-tuned with supervision.^[5]

The paradigm shifted dramatically in 2012 when AlexNet won ImageNet with a top-5 error of 15.3% compared to 26.2% for second place—an unprecedented 10.9% margin.^[6] This established supervised large-scale pre-training as the dominant transfer learning approach for computer vision.

Word embeddings era (2013-2017)

Word embeddings emerged as natural language processing's first scalable pre-training method. Word2Vec (2013) introduced CBOW and Skip-gram architectures that learned dense vector representations capturing semantic relationships.^[7] GloVe (2014) combined global matrix factorization with local context windows.^[8] These static embeddings gave way to contextualized representations with ELMo (2018), which used bidirectional LSTMs to generate different embeddings for the same word based on context.^[9]

Transformer revolution (2017-present)

The June 2017 paper "Attention Is All You Need" introduced transformers, fundamentally changing AI.^[10] The architecture eliminated recurrence and convolution, relying entirely on multi-head self-attention mechanisms that compute relationships between all positions in parallel.

BERT's October 2018 release demonstrated transformers' power for language understanding through bidirectional pre-training using masked language modeling.^[11] GPT paralleled BERT with a unidirectional approach, with GPT-3 (2020) demonstrating that scale plus autoregressive pre-training yields emergent capabilities like few-shot learning.^[12]

Core Concepts

The Two-Stage Paradigm: Pre-training and Fine-tuning

The development of large-scale AI models is now dominated by a two-stage paradigm that separates general knowledge acquisition from task-specific adaptation.^[13]

Stage 1: Pre-training: In this initial, computationally intensive phase, a model is trained on a massive, often unlabeled, dataset. The objective is typically self-supervised, such as predicting the next word in a sentence or filling in missing parts of an image.^[13] This stage is where the model learns fundamental concepts, from the grammar and syntax of language to the textures and shapes of visual objects. The result of this stage is a pre-trained model, which serves as a versatile foundation.^[5]
Stage 2: Fine-tuning: The pre-trained model is then adapted for a specific application by continuing the training process on a much smaller, task-specific, and typically labeled dataset. For example, a language model pre-trained on the entire internet might be fine-tuned on a dataset of customer reviews to perform sentiment analysis.^[11] This step adjusts the model's pre-existing parameters to specialize its knowledge for the target task.^[14]

This two-stage approach represents a fundamental philosophical shift in machine learning, moving the field away from building highly specialized, single-task models from scratch toward creating generalist, reusable foundation models.

Relationship with Transfer Learning

Pre-training is the core mechanism that enables transfer learning.^[14]^[15] Transfer learning is a machine learning method where a model developed for one task is reused as the starting point for a model on a second, related task.^[15] The pre-trained model is the tangible artifact that stores the knowledge to be transferred.

This transfer can be implemented in two primary ways:

Feature Extraction: The pre-trained model is used as a fixed feature extractor with parameters frozen, and only new task-specific layers are trained.^[16]
Fine-tuning: Some or all of the pre-trained model's parameters are updated during training on the new dataset, allowing deeper adaptation to the new task.^[16]

Technical Approach

Core Mechanics

Pre-training operates through self-supervised learning objectives that enable models to extract knowledge from unlabeled data at massive scale. The process involves:

Data collection: Gathering datasets from sources like Common Crawl (15+ trillion tokens) or ImageNet (14+ million images)
Pretext task design: Creating self-supervised objectives like predicting masked words, forecasting next tokens, or matching image-text pairs
Training process:
- Forward propagation through deep neural networks (typically transformers with 12-96 layers)
- Calculating loss against self-supervised objectives
- Backpropagating gradients
- Updating billions of parameters using optimizers like Adam or AdamW

Modern pre-training runs for weeks to months on thousands of GPUs or TPUs, processing datasets measured in terabytes.^[17]

Pre-training Objectives

Masked Language Modeling (MLM)

Popularized by BERT, MLM randomly masks approximately 15% of input tokens and trains models to predict them using full bidirectional context.^[11] The masking strategy:

80% of selected tokens become [MASK]
10% swap to random tokens
10% remain unchanged

Variants include:

RoBERTa: Removed next sentence prediction, used dynamic masking, trained on 10x more data^[18]
SpanBERT: Masks contiguous spans rather than individual tokens^[19]
ELECTRA: Discriminative pre-training detecting replaced tokens^[20]

Autoregressive Language Modeling

The GPT family employs autoregressive pre-training, predicting each token given all previous tokens in left-to-right fashion.^[21] The objective maximizes likelihood: P(x₁, ..., xₙ) = ∏P(xᵢ|x₁, ..., xᵢ₋₁)

Contrastive Learning

Contrastive learning revolutionized self-supervised pre-training in computer vision and multimodal domains:

SimCLR: Learns representations maximizing agreement between augmented views of same image^[22]
CLIP: Jointly trains image and text encoders on 400 million image-text pairs^[23]
MAE: Masks 75% of image patches and reconstructs missing pixels^[24]

Comparative Analysis of Pre-training Paradigms
Feature	Generative Pre-training	Contrastive Pre-training
Core Objective	Reconstruct or predict parts of the input data; model the data distribution	Learn an embedding space where similar samples are close and dissimilar samples are far apart
Supervision Signal	The input data itself (for example the original unmasked token, the complete image)	The relationship between pairs of data points (positive vs. negative pairs)
Typical Architectures	Autoencoders (AEs, VAEs), GANs, Autoregressive Models (for example Transformers)	Siamese Networks, models using InfoNCE or Triplet Loss objectives (for example SimCLR, MoCo)
Example Pretext Tasks	Masked Language Modeling (BERT), Next-Token Prediction (GPT), Image Inpainting, Denoising	Identifying an augmented version of an image from a batch of other images
Strengths	Can generate new data; learns a rich, dense representation of the data distribution	Often learns representations highly effective for downstream classification tasks; can be more sample-efficient
Weaknesses	Can be computationally expensive; may have inferior data scaling capacity	Can be data-hungry and prone to over-fitting on limited data; sensitive to negative sample choice

Datasets

Language Datasets

Dataset	Size	Description	Used by
Common Crawl	320+ TB raw	Web crawl data	Most modern LLMs
C4	750 GB	Cleaned Common Crawl	T5, many others
The Pile	825 GB	22 diverse sources	GPT-Neo, GPT-J
RefinedWeb	5+ trillion tokens	Filtered Common Crawl	Falcon
RedPajama	1.2 trillion tokens	Open reproduction of LLaMA data	Open models
BookCorpus	800M words	11,000+ books	BERT, GPT
Wikipedia	2.5B words (English)	Encyclopedia articles	BERT, GPT, most LLMs

Vision Datasets

Dataset	Size	Description	Primary use
ImageNet	1.2M images	1000 object classes	Supervised pre-training
JFT-300M	300M images	Google internal dataset	Large-scale pre-training
LAION-5B	5.85B pairs	Image-text pairs from web	CLIP-style training
DataComp	12.8B pairs	CommonPool for research	Multimodal research
COCO	330K images	Object detection/segmentation	Vision tasks

Computational Requirements

Training Costs

Model	Parameters	Training time	Hardware	Estimated cost
BERT-Base	110M	4 days	16 TPUs	$500-1,000
GPT-3	175B	~34 days	1024 A100s	$4.6 million
Llama 2 (7B)	7B	1-2 weeks	64-128 A100s	$200,000-500,000
Llama 3.1 (405B)	405B	30.84M GPU-hours	24,576 H100s	$10-20 million
T5	11B	4 weeks	256 TPU v3	$1.5 million
CLIP	400M	12 days	592 V100s	$600,000

Hardware Evolution

NVIDIA A100 (2020): 312 TFLOPS, 40/80GB memory, workhorse for 2020-2023 training^[25]
NVIDIA H100 (2022): 2-3x faster than A100, becoming standard for frontier models^[26]
Google TPU v5e (2023): Pods with 50,944 chips achieving 10 exaFLOPS^[27]

Notable Pre-Trained Models

Model	Domain	Release Year	Parameters	Key Objective	Developer
Word2Vec	NLP	2013	300 dim	Skip-gram/CBOW	Google
ResNet-50	CV	2015	25M	Image Classification	Microsoft
BERT	NLP	2018	340M	MLM, NSP	Google
GPT-3	NLP	2020	175B	Autoregressive LM	OpenAI
RoBERTa	NLP	2019	355M	Dynamic MLM	Facebook
T5	NLP	2019	11B	Text-to-Text	Google
Vision Transformer	CV	2020	86M-632M	Image Classification	Google
CLIP	Multimodal	2021	400M	Contrastive Alignment	OpenAI
DALL-E	Multimodal	2021	12B	Text-to-Image	OpenAI
ELECTRA	NLP	2020	340M	Replaced Token Detection	Google
XLNet	NLP	2019	340M	Permutation LM	Google/CMU
Llama 2	NLP	2023	7B-70B	Autoregressive LM	Meta
Flamingo	Multimodal	2022	80B	Visual Language	DeepMind

Applications

Natural Language Processing

Pre-trained language models power nearly all modern NLP applications:^[14]^[28]

Question answering: Models achieve 89.91% F1 on SQuAD 2.0, approaching human performance^[29]
Code generation: GitHub Copilot assists with 43% of code written by developers^[30]
Conversational AI: Powering chatbots and virtual assistants like ChatGPT
Machine translation: Near-human quality on many language pairs
Sentiment analysis: 90%+ accuracy for review and social media monitoring
Text summarization: Condensing documents while preserving key information

Computer Vision

Pre-trained vision models serve as backbones for diverse applications:^[31]

Image classification: Vision Transformers achieve 88.5%+ ImageNet accuracy^[32]
Object detection: 50+ box AP on COCO
Medical imaging: 90%+ accuracy for pathology detection, cancer screening in CT scans and MRIs
Autonomous vehicles: Real-time object detection for pedestrians, vehicles, traffic signs
Industrial automation: Quality control, safety monitoring, defect detection

Speech and Multimodal Systems

Speech Recognition: Pre-training builds robust speech-to-text systems less sensitive to accents and noise^[33]
Text-to-image generation: Stable Diffusion uses CLIP embeddings for image synthesis^[34]
Visual question answering: Flamingo achieves state-of-the-art on 16 benchmarks^[35]
Zero-shot classification: CLIP matches supervised models without task-specific training

Benefits and Advantages

Pre-training offers several critical advantages:^[36]

Resource efficiency: Reduces labeled data requirements by 10-100x
Faster development: Fine-tuning takes hours/days vs. weeks/months from scratch
Better performance: Pre-trained models consistently outperform random initialization
Transfer learning: Knowledge transfers across related tasks and domains
Democratization: Smaller teams can leverage frontier model capabilities
Generalization: Models learn robust features that work across diverse applications

Challenges and Limitations

Environmental Impact

The computational demands of pre-training create significant environmental costs:

Carbon emissions: GPT-3 training produced ~626,000 pounds of CO₂^[37]
Water consumption: Estimated 700,000 liters for cooling during GPT-3 training^[38]
Energy use: ChatGPT queries use ~10x more energy than Google searches^[39]

Bias, Fairness, and Ethical Considerations

Pre-trained models inherit and amplify societal biases present in training data:^[40]

Gender bias: Models associate professions with specific genders (nurse→women, engineer→men)^[41]
Racial and ethnic bias: Preference for stereotypically Caucasian names in leadership recommendations^[41]
Linguistic bias: C4 filters African American English at 42% vs 6.2% for White American English^[42]
Disability bias: Perpetuation of negative stereotypes about people with disabilities^[43]

Data and Legal Issues

Copyright concerns: Active litigation regarding training on copyrighted content
Privacy violations: Models may memorize and reproduce personal information
Data contamination: Benchmarks appearing in training data inflate scores
Data quality: Web-scraped data contains misinformation, toxicity, and bias

Accessibility and Centralization

Geographic concentration: US produces 5x more foundation models than China^[44]
Compute barriers: Training frontier models requires $10-100M+ in resources
Hardware costs: Single H100 GPU costs $30,000-40,000
Technical expertise: Requires specialized knowledge in distributed systems and optimization

Future Directions

Efficiency Improvements

Mixture of Experts: Mixtral 8x7B achieves 70B performance with 13B active parameters^[45]
Knowledge distillation: Creating smaller models matching larger model performance
Quantization: Reducing precision from FP16 to INT8/INT4 with minimal accuracy loss
Flash Attention: 2-4x speedup through optimized attention computation^[46]
ALBERT: Parameter sharing reduces model size by 18x^[47]

Multimodal and Lifelong Learning

Multimodal pre-training: Models processing text, images, audio, and video seamlessly^[48]
Continual pre-training: Updating models with new data without full retraining^[49]
Domain adaptation: Specializing models for medicine, law, science
Lifelong learning: Overcoming catastrophic forgetting to learn continuously^[50]

Beyond Pre-training: Peak Data and Alignment

Some researchers suggest the era of scaling pre-training datasets is ending:^[51]

Peak data hypothesis: Exhausting high-quality public data sources
Post-training focus: Greater emphasis on alignment techniques like RLHF
Constitutional AI: Self-improvement guided by explicit principles^[52]
Direct Preference Optimization: More efficient alternatives to RLHF^[53]
Agentic AI: Systems learning from environment interaction rather than static datasets

References

↑ Bengio, Y., Courville, A., & Vincent, P. (2013). "Representation Learning: A Review and New Perspectives". IEEE Transactions on Pattern Analysis and Machine Intelligence. https://arxiv.org/abs/1206.5538
↑ NIST CSRC AI Glossary (2025). Definition of "pre-training": "training a general-purpose model on publicly-available data, often followed by fine-tuning for task-specific information." https://csrc.nist.gov/glossary/term/pre_training
↑ ^3.0 ^3.1 Bommasani, R., et al. (2021). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. https://arxiv.org/abs/2108.07258
↑ Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). "A fast learning algorithm for deep belief nets". Neural Computation, 18(7), 1527-1554. https://www.cs.toronto.edu/~hinton/absps/fastnc.pdf
↑ ^5.0 ^5.1 Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., & Bengio, S. (2010). "Why Does Unsupervised Pre-training Help Deep Learning?". Journal of Machine Learning Research. https://jmlr.org/papers/v11/erhan10a.html
↑ Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks". NeurIPS. https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
↑ Mikolov, T., et al. (2013). "Efficient Estimation of Word Representations in Vector Space". arXiv. https://arxiv.org/abs/1301.3781
↑ Pennington, J., Socher, R., & Manning, C. (2014). "GloVe: Global Vectors for Word Representation". EMNLP. https://nlp.stanford.edu/pubs/glove.pdf
↑ Peters, M., et al. (2018). "Deep Contextualized Word Representations". NAACL. https://arxiv.org/abs/1802.05365
↑ Vaswani, A., et al. (2017). "Attention Is All You Need". NeurIPS. https://arxiv.org/abs/1706.03762
↑ ^11.0 ^11.1 ^11.2 Devlin, J., et al. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL. https://arxiv.org/abs/1810.04805
↑ Brown, T., et al. (2020). "Language Models are Few-Shot Learners". NeurIPS. https://arxiv.org/abs/2005.14165
↑ ^13.0 ^13.1 Liu, X., et al. (2020). "Self-supervised Learning: Generative or Contrastive". arXiv. https://arxiv.org/abs/2006.08218
↑ ^14.0 ^14.1 ^14.2 Lee, Angie (December 8, 2022). "What Is a Pretrained AI Model?". NVIDIA Blog. https://blogs.nvidia.com/blog/what-is-a-pretrained-ai-model/
↑ ^15.0 ^15.1 Amazon Web Services. "What Is Transfer Learning?". https://aws.amazon.com/what-is/transfer-learning/
↑ ^16.0 ^16.1 Dev.to. "Understanding the Differences: Fine-Tuning vs. Transfer Learning". https://dev.to/luxdevhq/understanding-the-differences-fine-tuning-vs-transfer-learning-370
↑ Touvron, H., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models". Meta AI. https://arxiv.org/abs/2307.09288
↑ Liu, Y., et al. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach". arXiv. https://arxiv.org/abs/1907.11692
↑ Joshi, M., et al. (2020). "SpanBERT: Improving Pre-training by Representing and Predicting Spans". TACL. https://arxiv.org/abs/1907.10529
↑ Clark, K., et al. (2020). "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators". ICLR. https://arxiv.org/abs/2003.10555
↑ Radford, A., et al. (2018). "Improving Language Understanding by Generative Pre-Training". OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
↑ Chen, T., et al. (2020). "A Simple Framework for Contrastive Learning of Visual Representations". ICML. https://arxiv.org/abs/2002.05709
↑ Radford, A., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision". ICML. https://arxiv.org/abs/2103.00020
↑ He, K., et al. (2022). "Masked Autoencoders Are Scalable Vision Learners". CVPR. https://arxiv.org/abs/2111.06377
↑ NVIDIA. (2020). "NVIDIA A100 Tensor Core GPU Architecture". https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf
↑ NVIDIA. (2022). "NVIDIA H100 Tensor Core GPU Architecture". https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet
↑ Google Cloud. (2023). "Cloud TPU v5e". https://cloud.google.com/tpu/docs/v5e
↑ Toloka. "What is Pre-training in LLM Development?". https://toloka.ai/blog/pre-training-in-llm-development/
↑ Rajpurkar, P., Jia, R., & Liang, P. (2018). "Know What You Don't Know: Unanswerable Questions for SQuAD". ACL. https://arxiv.org/abs/1806.03822
↑ GitHub. (2024). "GitHub Copilot Impact Research". https://github.blog/news-insights/research/
↑ Viso.ai. "Top 45 Computer Vision Applications in 2024". https://viso.ai/applications/computer-vision-applications/
↑ Dosovitskiy, A., et al. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". ICLR. https://arxiv.org/abs/2010.11929
↑ GeeksForGeeks. "What is Pre-training and its Objective?". https://www.geeksforgeeks.org/artificial-intelligence/what-is-pre-training-and-its-objective/
↑ Rombach, R., et al. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models". CVPR. https://arxiv.org/abs/2112.10752
↑ Alayrac, J. B., et al. (2022). "Flamingo: a Visual Language Model for Few-Shot Learning". NeurIPS. https://arxiv.org/abs/2204.14198
↑ Baeldung (2025). "What Does Pre-training a Neural Network Mean?". https://www.baeldung.com/cs/neural-network-pre-training
↑ Strubell, E., Ganesh, A., & McCallum, A. (2019). "Energy and Policy Considerations for Deep Learning in NLP". ACL. https://arxiv.org/abs/1906.02243
↑ Li, P., et al. (2023). "Making AI Less Thirsty: Uncovering and Addressing the Secret Water Footprint of AI Models". arXiv. https://arxiv.org/abs/2304.03271
↑ Wikipedia. "Environmental impact of artificial intelligence". https://en.wikipedia.org/wiki/Environmental_impact_of_artificial_intelligence
↑ Google Cloud. "What are foundation models?". https://cloud.google.com/discover/what-are-foundation-models
↑ ^41.0 ^41.1 Li, Z., et al. (2024). "Explicitly unbiased large language models still form biased mental models". PNAS. https://www.pnas.org/doi/10.1073/pnas.2416228122
↑ Dodge, J., et al. (2021). "Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus". EMNLP. https://arxiv.org/abs/2104.08758
↑ Venkit, P. N., et al. (2022). "A Study of Implicit Bias in Pretrained Language Models against People with Disabilities". COLING. https://aclanthology.org/2022.coling-1.113/
↑ Maslej, N., et al. (2024). "The AI Index 2024 Annual Report". Stanford HAI. https://aiindex.stanford.edu/report/
↑ Jiang, A. Q., et al. (2024). "Mixtral of Experts". arXiv. https://arxiv.org/abs/2401.04088
↑ Dao, T., et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness". NeurIPS. https://arxiv.org/abs/2205.14135
↑ Lan, Z., et al. (2020). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations". ICLR. https://arxiv.org/abs/1909.11942
↑ MultimodalPretraining.github.io. "Workshop on Multimodal Pre-training". https://multimodalpretraining.github.io/
↑ Ibrahim, M., et al. (2024). "Simple and Scalable Strategies to Continually Pre-train Large Language Models". arXiv. https://arxiv.org/abs/2403.08763
↑ Mehta, S. V. (2023). "Efficient Lifelong Learning in Deep Neural Networks". Carnegie Mellon University. https://kilthub.cmu.edu/articles/thesis/24992883
↑ PDFTranslate.ai. "Ilya Sutskever: LLM Pre-training as we know it will end". https://pdftranslate.ai/blog/llm-end-of-era
↑ Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback". Anthropic. https://arxiv.org/abs/2212.08073
↑ Rafailov, R., et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". NeurIPS. https://arxiv.org/abs/2305.18290

External links

Hugging Face Model Hub - Repository of pre-trained models
BERT GitHub Repository - Original BERT implementation
GPT-3 Applications - Examples of GPT-3 use cases
TensorFlow Hub - Pre-trained model repository
PyTorch Hub - Pre-trained models for PyTorch
Common Crawl - Large-scale web crawl data
ImageNet - Visual database for object recognition

[bengio2013-1] Bengio, Y., Courville, A., & Vincent, P. (2013). "Representation Learning: A Review and New Perspectives". IEEE Transactions on Pattern Analysis and Machine Intelligence. https://arxiv.org/abs/1206.5538

[NISTGlossary-2] NIST CSRC AI Glossary (2025). Definition of "pre-training": "training a general-purpose model on publicly-available data, often followed by fine-tuning for task-specific information." https://csrc.nist.gov/glossary/term/pre_training

[bommasani2021-3] 3.0 ^3.1 Bommasani, R., et al. (2021). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. https://arxiv.org/abs/2108.07258

[hinton2006-4] Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). "A fast learning algorithm for deep belief nets". Neural Computation, 18(7), 1527-1554. https://www.cs.toronto.edu/~hinton/absps/fastnc.pdf

[erhan2010-5] 5.0 ^5.1 Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., & Bengio, S. (2010). "Why Does Unsupervised Pre-training Help Deep Learning?". Journal of Machine Learning Research. https://jmlr.org/papers/v11/erhan10a.html

[krizhevsky2012-6] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks". NeurIPS. https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html

[mikolov2013-7] Mikolov, T., et al. (2013). "Efficient Estimation of Word Representations in Vector Space". arXiv. https://arxiv.org/abs/1301.3781

[pennington2014-8] Pennington, J., Socher, R., & Manning, C. (2014). "GloVe: Global Vectors for Word Representation". EMNLP. https://nlp.stanford.edu/pubs/glove.pdf

[peters2018-9] Peters, M., et al. (2018). "Deep Contextualized Word Representations". NAACL. https://arxiv.org/abs/1802.05365

[vaswani2017-10] Vaswani, A., et al. (2017). "Attention Is All You Need". NeurIPS. https://arxiv.org/abs/1706.03762

[devlin2018-11] 11.0 ^11.1 ^11.2 Devlin, J., et al. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL. https://arxiv.org/abs/1810.04805

[brown2020-12] Brown, T., et al. (2020). "Language Models are Few-Shot Learners". NeurIPS. https://arxiv.org/abs/2005.14165

[arXiv_SSL_Survey-13] 13.0 ^13.1 Liu, X., et al. (2020). "Self-supervised Learning: Generative or Contrastive". arXiv. https://arxiv.org/abs/2006.08218

[NVIDIAPretrained-14] 14.0 ^14.1 ^14.2 Lee, Angie (December 8, 2022). "What Is a Pretrained AI Model?". NVIDIA Blog. https://blogs.nvidia.com/blog/what-is-a-pretrained-ai-model/

[AWS_Transfer_Learning-15] 15.0 ^15.1 Amazon Web Services. "What Is Transfer Learning?". https://aws.amazon.com/what-is/transfer-learning/

[DevTo_Fine-Tuning-16] 16.0 ^16.1 Dev.to. "Understanding the Differences: Fine-Tuning vs. Transfer Learning". https://dev.to/luxdevhq/understanding-the-differences-fine-tuning-vs-transfer-learning-370

[touvron2023-17] Touvron, H., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models". Meta AI. https://arxiv.org/abs/2307.09288

[liu2019-18] Liu, Y., et al. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach". arXiv. https://arxiv.org/abs/1907.11692

[joshi2020-19] Joshi, M., et al. (2020). "SpanBERT: Improving Pre-training by Representing and Predicting Spans". TACL. https://arxiv.org/abs/1907.10529

[clark2020-20] Clark, K., et al. (2020). "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators". ICLR. https://arxiv.org/abs/2003.10555

[radford2018-21] Radford, A., et al. (2018). "Improving Language Understanding by Generative Pre-Training". OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

[chen2020-22] Chen, T., et al. (2020). "A Simple Framework for Contrastive Learning of Visual Representations". ICML. https://arxiv.org/abs/2002.05709

[radford2021-23] Radford, A., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision". ICML. https://arxiv.org/abs/2103.00020

[he2022-24] He, K., et al. (2022). "Masked Autoencoders Are Scalable Vision Learners". CVPR. https://arxiv.org/abs/2111.06377

[nvidia2020-25] NVIDIA. (2020). "NVIDIA A100 Tensor Core GPU Architecture". https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf

[nvidia2022-26] NVIDIA. (2022). "NVIDIA H100 Tensor Core GPU Architecture". https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet

[google2023-27] Google Cloud. (2023). "Cloud TPU v5e". https://cloud.google.com/tpu/docs/v5e

[Toloka_Pre-training-28] Toloka. "What is Pre-training in LLM Development?". https://toloka.ai/blog/pre-training-in-llm-development/

[rajpurkar2018-29] Rajpurkar, P., Jia, R., & Liang, P. (2018). "Know What You Don't Know: Unanswerable Questions for SQuAD". ACL. https://arxiv.org/abs/1806.03822

[github2024-30] GitHub. (2024). "GitHub Copilot Impact Research". https://github.blog/news-insights/research/

[VisoAI_CV_Applications-31] Viso.ai. "Top 45 Computer Vision Applications in 2024". https://viso.ai/applications/computer-vision-applications/

[dosovitskiy2021-32] Dosovitskiy, A., et al. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". ICLR. https://arxiv.org/abs/2010.11929

[GeeksForGeeks-33] GeeksForGeeks. "What is Pre-training and its Objective?". https://www.geeksforgeeks.org/artificial-intelligence/what-is-pre-training-and-its-objective/

[rombach2022-34] Rombach, R., et al. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models". CVPR. https://arxiv.org/abs/2112.10752

[alayrac2022-35] Alayrac, J. B., et al. (2022). "Flamingo: a Visual Language Model for Few-Shot Learning". NeurIPS. https://arxiv.org/abs/2204.14198

[Baeldung2025-36] Baeldung (2025). "What Does Pre-training a Neural Network Mean?". https://www.baeldung.com/cs/neural-network-pre-training

[strubell2019-37] Strubell, E., Ganesh, A., & McCallum, A. (2019). "Energy and Policy Considerations for Deep Learning in NLP". ACL. https://arxiv.org/abs/1906.02243

[li2023-38] Li, P., et al. (2023). "Making AI Less Thirsty: Uncovering and Addressing the Secret Water Footprint of AI Models". arXiv. https://arxiv.org/abs/2304.03271

[Wikipedia_AI_Environmental-39] Wikipedia. "Environmental impact of artificial intelligence". https://en.wikipedia.org/wiki/Environmental_impact_of_artificial_intelligence

[GoogleCloud_Foundation-40] Google Cloud. "What are foundation models?". https://cloud.google.com/discover/what-are-foundation-models

[PNAS_LLM_Biases-41] 41.0 ^41.1 Li, Z., et al. (2024). "Explicitly unbiased large language models still form biased mental models". PNAS. https://www.pnas.org/doi/10.1073/pnas.2416228122

[dodge2021-42] Dodge, J., et al. (2021). "Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus". EMNLP. https://arxiv.org/abs/2104.08758

[ACL_Disability-43] Venkit, P. N., et al. (2022). "A Study of Implicit Bias in Pretrained Language Models against People with Disabilities". COLING. https://aclanthology.org/2022.coling-1.113/

[maslej2024-44] Maslej, N., et al. (2024). "The AI Index 2024 Annual Report". Stanford HAI. https://aiindex.stanford.edu/report/

[jiang2024-45] Jiang, A. Q., et al. (2024). "Mixtral of Experts". arXiv. https://arxiv.org/abs/2401.04088

[dao2022-46] Dao, T., et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness". NeurIPS. https://arxiv.org/abs/2205.14135

[lan2020-47] Lan, Z., et al. (2020). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations". ICLR. https://arxiv.org/abs/1909.11942

[MultimodalWorkshop-48] MultimodalPretraining.github.io. "Workshop on Multimodal Pre-training". https://multimodalpretraining.github.io/

[ibrahim2024-49] Ibrahim, M., et al. (2024). "Simple and Scalable Strategies to Continually Pre-train Large Language Models". arXiv. https://arxiv.org/abs/2403.08763

[CMU_Lifelong-50] Mehta, S. V. (2023). "Efficient Lifelong Learning in Deep Neural Networks". Carnegie Mellon University. https://kilthub.cmu.edu/articles/thesis/24992883

[PDFTranslate-51] PDFTranslate.ai. "Ilya Sutskever: LLM Pre-training as we know it will end". https://pdftranslate.ai/blog/llm-end-of-era

[bai2022-52] Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback". Anthropic. https://arxiv.org/abs/2212.08073

[rafailov2023-53] Rafailov, R., et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". NeurIPS. https://arxiv.org/abs/2305.18290

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]