# Pre-training

> Source: https://aiwiki.ai/wiki/pre-training
> Updated: 2026-06-20
> Categories: Artificial Intelligence, Computer Vision, Deep Learning, Machine Learning, Natural Language Processing, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Pre-training** is the first and most compute-intensive stage of building a modern AI model: a [neural network](/wiki/neural_network) is trained on a massive, mostly unlabeled dataset using [self-supervised learning](/wiki/self-supervised_learning) to learn general-purpose representations, before it is later adapted to specific tasks through [fine-tuning](/wiki/fine_tuning).[1] The pre-trained result is a versatile [foundation model](/wiki/foundation_model), which the team that coined the term defined as "any model that is trained on broad data that can be adapted to a wide range of downstream tasks."[2] This two-stage approach has reshaped [artificial intelligence](/wiki/artificial_intelligence) across language, vision, and multimodal domains, letting models reach high accuracy with far less task-specific labeled data than training from scratch requires.[1]

Modern [foundation models](/wiki/foundation_model) like [GPT-4](/wiki/gpt-4), [BERT](/wiki/bert), and [CLIP](/wiki/clip) derive their capabilities primarily from pre-training on trillions of tokens or hundreds of millions of images, learning rich patterns that transfer effectively across countless downstream applications.[3] Pre-training addresses the fundamental challenge of data scarcity: rather than requiring millions of labeled examples for each task, a model pre-trained on general data can adapt to new tasks with mere thousands or even dozens of examples.

## What is pre-training in machine learning?

Pre-training is the initial training phase of a model on a broad dataset or task to learn general patterns and representations before it is fine-tuned on a specific problem.[1] In this stage, a model (often called a [foundation model](/wiki/foundation_model) when it is general-purpose) is trained on large-scale data, frequently using [unlabeled data](/wiki/unlabeled_data) and [self-supervised learning](/wiki/self-supervised_learning) objectives, to acquire a broad understanding of features or knowledge.[3] The objective is typically self-supervised, such as predicting the next word in a sentence or filling in masked parts of an image, so no human labels are needed and the data itself supplies the supervision signal.

## History

### Early foundations (2006-2012)

Pre-training's conceptual roots trace to 2006, when [Geoffrey Hinton](/wiki/geoffrey_hinton), Simon Osindero, and Yee-Whye Teh published "A Fast Learning Algorithm for Deep Belief Nets" in Neural Computation, introducing greedy, layer-wise unsupervised pre-training for deep belief networks using Restricted Boltzmann Machines.[4] This breakthrough enabled training deep networks that were previously intractable due to [vanishing gradient problems](/wiki/vanishing_gradient_problem): each layer was pre-trained unsupervised, then the full network was fine-tuned with supervision.[5]

The paradigm shifted dramatically in 2012 when [AlexNet](/wiki/alexnet) won the [ImageNet](/wiki/imagenet) Large Scale Visual Recognition Challenge with a top-5 error of 15.3%, compared to 26.2% for the second-place entry, a margin of 10.9 percentage points.[6] This established supervised large-scale pre-training as the dominant [transfer learning](/wiki/transfer_learning) approach for [computer vision](/wiki/computer_vision).

### Word embeddings era (2013-2017)

Word embeddings emerged as natural language processing's first scalable pre-training method. [Word2Vec](/wiki/word2vec) (2013) introduced CBOW and Skip-gram architectures that learned dense vector representations capturing semantic relationships.[7] GloVe (2014) combined global matrix factorization with local context windows.[8] These static embeddings gave way to contextualized representations with ELMo (2018), which used bidirectional [LSTMs](/wiki/lstm) to generate different embeddings for the same word based on context.[9]

### Transformer revolution (2017-present)

The paper "[Attention Is All You Need](/wiki/attention_is_all_you_need)," posted to arXiv on June 12, 2017, introduced [transformers](/wiki/transformer), fundamentally changing AI.[10] The architecture eliminated recurrence and convolution, relying entirely on multi-head self-attention mechanisms that compute relationships between all positions in parallel.

[BERT](/wiki/bert)'s October 2018 release demonstrated transformers' power for language understanding through bidirectional pre-training using masked language modeling. As Devlin et al. wrote, "BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers."[11] [GPT](/wiki/gpt) paralleled BERT with a unidirectional, autoregressive approach. With [GPT-3](/wiki/gpt-3) (2020), OpenAI introduced a 175-billion-parameter model, which the paper described as "10x more than any previous non-sparse language model," demonstrating that scale plus autoregressive pre-training yields emergent capabilities like [few-shot learning](/wiki/few-shot_learning).[12]

## Core Concepts

### The Two-Stage Paradigm: Pre-training and Fine-tuning

The development of large-scale AI models is now dominated by a two-stage paradigm that separates general knowledge acquisition from task-specific adaptation.[13]

1. **Stage 1: Pre-training:** In this initial, computationally intensive phase, a model is trained on a massive, often unlabeled, dataset. The objective is typically self-supervised, such as predicting the next word in a sentence or filling in missing parts of an image.[13] This stage is where the model learns fundamental concepts, from the grammar and syntax of language to the textures and shapes of visual objects. The result of this stage is a **pre-trained model**, which serves as a versatile foundation.[5]

2. **Stage 2: [Fine-tuning](/wiki/fine_tuning):** The pre-trained model is then adapted for a specific application by continuing the training process on a much smaller, task-specific, and typically labeled dataset. For example, a language model pre-trained on the entire internet might be fine-tuned on a dataset of customer reviews to perform sentiment analysis.[11] This step adjusts the model's pre-existing parameters to specialize its knowledge for the target task.[14]

This two-stage approach represents a fundamental philosophical shift in machine learning, moving the field away from building highly specialized, single-task models from scratch toward creating generalist, reusable [foundation models](/wiki/foundation_model).

### How does pre-training relate to transfer learning?

Pre-training is the core mechanism that enables **[transfer learning](/wiki/transfer_learning)**.[14][15] Transfer learning is a machine learning method where a model developed for one task is reused as the starting point for a model on a second, related task.[15] The pre-trained model is the tangible artifact that stores the knowledge to be transferred.

This transfer can be implemented in two primary ways:

- **Feature Extraction:** The pre-trained model is used as a fixed feature extractor with parameters frozen, and only new task-specific layers are trained.[16]

- **Fine-tuning:** Some or all of the pre-trained model's parameters are updated during training on the new dataset, allowing deeper adaptation to the new task.[16]

## Technical Approach

### Core Mechanics

Pre-training operates through [self-supervised learning](/wiki/self-supervised_learning) objectives that enable models to extract knowledge from unlabeled data at massive scale. The process involves:

1. **Data collection**: Gathering datasets from sources like [Common Crawl](/wiki/common_crawl) (15+ trillion tokens) or ImageNet (14+ million images)

2. **Pretext task design**: Creating self-supervised objectives like predicting masked words, forecasting next tokens, or matching image-text pairs

3. **Training process**:
- Forward propagation through deep neural networks (typically transformers with 12-96 layers)

- Calculating loss against self-supervised objectives

- [Backpropagating](/wiki/backpropagation) gradients

- Updating billions of parameters using optimizers like Adam or AdamW

Modern pre-training runs for weeks to months on thousands of GPUs or [TPUs](/wiki/tpu), processing datasets measured in terabytes.[17]

### Pre-training Objectives

#### Masked Language Modeling (MLM)

Popularized by BERT, MLM randomly masks approximately 15% of input tokens and trains models to predict them using full bidirectional context.[11] The masking strategy:

- 80% of selected tokens become [MASK]

- 10% swap to random tokens

- 10% remain unchanged

Variants include:

- **[RoBERTa](/wiki/roberta)**: Removed next sentence prediction, used dynamic masking, trained on 10x more data[18]

- **SpanBERT**: Masks contiguous spans rather than individual tokens[19]

- **[ELECTRA](/wiki/electra)**: Discriminative pre-training detecting replaced tokens[20]

#### Autoregressive Language Modeling

The [GPT](/wiki/gpt) family employs autoregressive pre-training, predicting each token given all previous tokens in left-to-right fashion.[21] The objective maximizes likelihood:
P(x₁, ..., xₙ) = ∏P(xᵢ|x₁, ..., xᵢ₋₁)

#### Contrastive Learning

[Contrastive learning](/wiki/contrastive_learning) revolutionized self-supervised pre-training in computer vision and multimodal domains:

- **SimCLR**: Learns representations maximizing agreement between augmented views of same image[22]

- **CLIP**: Jointly trains image and text encoders on 400 million image-text pairs[23]

- **MAE**: Masks 75% of image patches and reconstructs missing pixels[24]

| Feature | Generative Pre-training | Contrastive Pre-training |
| --- | --- | --- |
| Core Objective | Reconstruct or predict parts of the input data; model the data distribution | Learn an embedding space where similar samples are close and dissimilar samples are far apart |
| Supervision Signal | The input data itself (for example the original unmasked token, the complete image) | The relationship between pairs of data points (positive vs. negative pairs) |
| Typical Architectures | Autoencoders (AEs, VAEs), GANs, Autoregressive Models (for example Transformers) | Siamese Networks, models using InfoNCE or Triplet Loss objectives (for example SimCLR, MoCo) |
| Example Pretext Tasks | Masked Language Modeling (BERT), Next-Token Prediction (GPT), Image Inpainting, Denoising | Identifying an augmented version of an image from a batch of other images |
| Strengths | Can generate new data; learns a rich, dense representation of the data distribution | Often learns representations highly effective for downstream classification tasks; can be more sample-efficient |
| Weaknesses | Can be computationally expensive; may have inferior data scaling capacity | Can be data-hungry and prone to over-fitting on limited data; sensitive to negative sample choice |

## Datasets

### Language Datasets

| Dataset | Size | Description | Used by |
| --- | --- | --- | --- |
| Common Crawl | 320+ TB raw | Web crawl data | Most modern LLMs |
| C4 | 750 GB | Cleaned Common Crawl | T5, many others |
| The Pile | 825 GB | 22 diverse sources | GPT-Neo, GPT-J |
| RefinedWeb | 5+ trillion tokens | Filtered Common Crawl | Falcon |
| RedPajama | 1.2 trillion tokens | Open reproduction of LLaMA data | Open models |
| BookCorpus | 800M words | 11,000+ books | [BERT](/wiki/bert), [GPT](/wiki/gpt) |
| Wikipedia | 2.5B words (English) | Encyclopedia articles | BERT, GPT, most LLMs |

### Vision Datasets

| Dataset | Size | Description | Primary use |
| --- | --- | --- | --- |
| ImageNet | 1.2M images | 1000 object classes | Supervised pre-training |
| JFT-300M | 300M images | Google internal dataset | Large-scale pre-training |
| LAION-5B | 5.85B pairs | Image-text pairs from web | CLIP-style training |
| DataComp | 12.8B pairs | CommonPool for research | Multimodal research |
| COCO | 330K images | Object detection/segmentation | Vision tasks |

## How much does pre-training cost?

Pre-training a frontier model is among the most expensive computations in industry, with budgets ranging from a few hundred dollars for a small encoder to tens of millions of dollars for the largest models. Lambda Labs estimated that pre-training GPT-3 required roughly 3.14 x 10^23 floating-point operations, which at $1.5 per GPU-hour on a V100 server would cost about $4.6 million.[12] Meta reported that pre-training [Llama 3.1](/wiki/llama_3_1) 405B on more than 15 trillion tokens took 30.84 million GPU-hours on a cluster of 16,384 NVIDIA H100 80GB GPUs.[25]

### Training Costs

| Model | Parameters | Training time | Hardware | Estimated cost |
| --- | --- | --- | --- | --- |
| BERT-Base | 110M | 4 days | 16 TPUs | $500-1,000 |
| [GPT-3](/wiki/gpt-3) | 175B | ~34 days | 1024 A100s | $4.6 million |
| Llama 2 (7B) | 7B | 1-2 weeks | 64-128 A100s | $200,000-500,000 |
| Llama 3.1 (405B) | 405B | 30.84M GPU-hours | 16,384 H100s | $10-20 million |
| T5 | 11B | 4 weeks | 256 TPU v3 | $1.5 million |
| CLIP | 400M | 12 days | 592 V100s | $600,000 |

### Hardware Evolution

- **NVIDIA A100** (2020): 312 TFLOPS, 40/80GB memory, workhorse for 2020-2023 training[26]

- **NVIDIA H100** (2022): 2-3x faster than A100, becoming standard for frontier models[27]

- **Google TPU v5e** (2023): Pods with 50,944 chips achieving 10 exaFLOPS[28]

## Notable Pre-Trained Models

| Model | Domain | Release Year | Parameters | Key Objective | Developer |
| --- | --- | --- | --- | --- | --- |
| Word2Vec | NLP | 2013 | 300 dim | Skip-gram/CBOW | Google |
| ResNet-50 | CV | 2015 | 25M | Image Classification | Microsoft |
| [BERT](/wiki/bert) | NLP | 2018 | 340M | MLM, NSP | Google |
| [GPT-3](/wiki/gpt-3) | NLP | 2020 | 175B | Autoregressive LM | OpenAI |
| RoBERTa | NLP | 2019 | 355M | Dynamic MLM | Facebook |
| T5 | NLP | 2019 | 11B | Text-to-Text | Google |
| Vision Transformer | CV | 2020 | 86M-632M | Image Classification | Google |
| CLIP | Multimodal | 2021 | 400M | Contrastive Alignment | OpenAI |
| [DALL-E](/wiki/dall-e) | Multimodal | 2021 | 12B | Text-to-Image | OpenAI |
| ELECTRA | NLP | 2020 | 340M | Replaced Token Detection | Google |
| XLNet | NLP | 2019 | 340M | Permutation LM | Google/CMU |
| Llama 2 | NLP | 2023 | 7B-70B | Autoregressive LM | Meta |
| Flamingo | Multimodal | 2022 | 80B | Visual Language | DeepMind |

## Applications

### Natural Language Processing

Pre-trained language models power nearly all modern NLP applications:[14][29]

- **[Question answering](/wiki/question_answering)**: Models achieve 89.91% F1 on [SQuAD](/wiki/squad) 2.0, approaching human performance[30]

- **Code generation**: Microsoft CEO Satya Nadella stated in March 2023 that [GitHub Copilot](/wiki/github_copilot) was writing 46% of code in files where it is enabled[31]

- **Conversational AI**: Powering chatbots and virtual assistants like [ChatGPT](/wiki/chatgpt)

- **[Machine translation](/wiki/machine_translation)**: Near-human quality on many language pairs

- **[Sentiment analysis](/wiki/sentiment_analysis)**: 90%+ accuracy for review and social media monitoring

- **[Text summarization](/wiki/text_summarization)**: Condensing documents while preserving key information

### Computer Vision

Pre-trained vision models serve as backbones for diverse applications:[32]

- **Image classification**: Vision Transformers achieve 88.5%+ ImageNet accuracy[33]

- **[Object detection](/wiki/object_detection)**: 50+ box AP on COCO

- **Medical imaging**: 90%+ accuracy for pathology detection, cancer screening in CT scans and MRIs

- **Autonomous vehicles**: Real-time object detection for pedestrians, vehicles, traffic signs

- **Industrial automation**: Quality control, safety monitoring, defect detection

### Speech and Multimodal Systems

- **Speech Recognition**: Pre-training builds robust speech-to-text systems less sensitive to accents and noise[34]

- **Text-to-image generation**: [Stable Diffusion](/wiki/stable_diffusion) uses CLIP embeddings for image synthesis[35]

- **Visual question answering**: Flamingo achieves state-of-the-art on 16 benchmarks[36]

- **Zero-shot classification**: CLIP matches supervised models without task-specific training

## Benefits and Advantages

Pre-training offers several critical advantages:[37]

- **Resource efficiency**: Reduces labeled data requirements by 10-100x

- **Faster development**: Fine-tuning takes hours/days vs. weeks/months from scratch

- **Better performance**: Pre-trained models consistently outperform random initialization

- **Transfer learning**: Knowledge transfers across related tasks and domains

- **Democratization**: Smaller teams can leverage frontier model capabilities

- **Generalization**: Models learn robust features that work across diverse applications

## Challenges and Limitations

### Environmental Impact

The computational demands of pre-training create significant environmental costs:

- **Carbon emissions**: GPT-3 training produced an estimated 552 metric tons of CO2 (about 1.2 million pounds), per a 2021 Google and UC Berkeley analysis[38]

- **Water consumption**: Estimated 700,000 liters for cooling during GPT-3 training[39]

- **Energy use**: ChatGPT queries have been estimated to use roughly 10x more energy than a Google search[40]

### Bias, Fairness, and Ethical Considerations

Pre-trained models inherit and amplify societal biases present in training data:[41]

- **Gender bias**: Models associate professions with specific genders (nurse→women, engineer→men)[42]

- **Racial and ethnic bias**: Preference for stereotypically Caucasian names in leadership recommendations[42]

- **Linguistic bias**: C4 filters African American English at 42% vs 6.2% for White American English[43]

- **Disability bias**: Perpetuation of negative stereotypes about people with disabilities[44]

### Data and Legal Issues

- **Copyright concerns**: Active litigation regarding training on copyrighted content

- **Privacy violations**: Models may memorize and reproduce personal information

- **Data contamination**: [Benchmarks](/wiki/benchmarks) appearing in training data inflate scores

- **Data quality**: Web-scraped data contains misinformation, toxicity, and bias

### Accessibility and Centralization

- **Geographic concentration**: The 2024 Stanford AI Index reported that the United States produced 40 notable AI models in 2023, far more than China's 15[45]

- **Compute barriers**: Training frontier models requires $10-100M+ in resources

- **Hardware costs**: A single H100 GPU has carried list and street prices in the $25,000-40,000 range

- **Technical expertise**: Requires specialized knowledge in distributed systems and optimization

## Future Directions

### Efficiency Improvements

- **[Mixture of Experts](/wiki/mixture_of_experts)**: [Mixtral](/wiki/mixtral) 8x7B achieves 70B performance with 13B active parameters[46]

- **[Knowledge distillation](/wiki/knowledge_distillation)**: Creating smaller models matching larger model performance

- **[Quantization](/wiki/quantization)**: Reducing precision from FP16 to INT8/INT4 with minimal accuracy loss

- **Flash [Attention](/wiki/attention)**: 2-4x speedup through optimized attention computation[47]

- **[ALBERT](/wiki/albert)**: Parameter sharing reduces model size by 18x[48]

### Multimodal and Lifelong Learning

- **Multimodal pre-training**: Models processing text, images, audio, and video seamlessly[49]

- **Continual pre-training**: Updating models with new data without full retraining[50]

- **Domain adaptation**: Specializing models for medicine, law, science

- **Lifelong learning**: Overcoming catastrophic forgetting to learn continuously[51]

### Beyond Pre-training: Peak Data and Alignment

Some researchers suggest the era of scaling pre-training datasets is ending:[52]

- **Peak data hypothesis**: Exhausting high-quality public data sources

- **[Post-training](/wiki/post-training) focus**: Greater emphasis on alignment techniques like [RLHF](/wiki/rlhf)

- **[Constitutional AI](/wiki/constitutional_ai)**: Self-improvement guided by explicit principles[53]

- **[Direct Preference Optimization](/wiki/dpo)**: More efficient alternatives to RLHF[54]

- **Agentic AI**: Systems learning from environment interaction rather than static datasets

## See also

- [Transfer learning](/wiki/transfer_learning)

- [Foundation models](/wiki/foundation_models)

- Fine-tuning

- [Self-supervised learning](/wiki/self-supervised_learning)

- [Transformer](/wiki/transformer) (machine learning model)

- [BERT](/wiki/bert)

- [GPT-3](/wiki/gpt-3)

- [Large language model](/wiki/large_language_model)

- Zero-shot learning

- [Few-shot learning](/wiki/few-shot_learning)

- Masked language modeling

- Contrastive learning

- Word embeddings

## References

1. IBM, "What is pre-training?" IBM Think Topics. https://www.ibm.com/think/topics/pretraining
2. Rishi Bommasani et al., "On the Opportunities and Risks of Foundation Models," Center for Research on Foundation Models (CRFM), Stanford University, 2021. https://arxiv.org/abs/2108.07258
3. Stanford HAI, "Reflections on Foundation Models," Stanford Institute for Human-Centered AI. https://hai.stanford.edu/news/reflections-foundation-models
4. Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh, "A Fast Learning Algorithm for Deep Belief Nets," Neural Computation, vol. 18, no. 7, pp. 1527-1554, July 2006. https://www.cs.toronto.edu/~hinton/absps/fastnc.pdf
5. Google Cloud, "What is a foundation model?" https://cloud.google.com/discover/what-is-a-foundation-model
6. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," Advances in Neural Information Processing Systems 25 (NeurIPS 2012). https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
7. Tomas Mikolov et al., "Efficient Estimation of Word Representations in Vector Space," arXiv:1301.3781, 2013. https://arxiv.org/abs/1301.3781
8. Jeffrey Pennington, Richard Socher, and Christopher D. Manning, "GloVe: Global Vectors for Word Representation," EMNLP 2014. https://nlp.stanford.edu/projects/glove/
9. Matthew E. Peters et al., "Deep Contextualized Word Representations" (ELMo), NAACL 2018. https://arxiv.org/abs/1802.05365
10. Ashish Vaswani et al., "Attention Is All You Need," arXiv:1706.03762, June 12, 2017. https://arxiv.org/abs/1706.03762
11. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv:1810.04805, October 2018. https://arxiv.org/abs/1810.04805
12. Tom B. Brown et al., "Language Models are Few-Shot Learners" (GPT-3), arXiv:2005.14165, 2020. https://arxiv.org/abs/2005.14165 ; cost estimate from Lambda Labs, "OpenAI's GPT-3 Language Model: A Technical Overview." https://lambda.ai/blog/demystifying-gpt-3
13. Google Cloud, "What is generative AI?" https://cloud.google.com/use-cases/generative-ai
14. Sebastian Ruder, "Transfer Learning - Machine Learning's Next Frontier," 2017. https://www.ruder.io/transfer-learning/
15. IBM, "What is transfer learning?" https://www.ibm.com/think/topics/transfer-learning
16. PyTorch, "Transfer Learning for Computer Vision Tutorial." https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html
17. Jared Kaplan et al., "Scaling Laws for Neural Language Models," arXiv:2001.08361, 2020. https://arxiv.org/abs/2001.08361
18. Yinhan Liu et al., "RoBERTa: A Robustly Optimized BERT Pretraining Approach," arXiv:1907.11692, 2019. https://arxiv.org/abs/1907.11692
19. Mandar Joshi et al., "SpanBERT: Improving Pre-training by Representing and Predicting Spans," arXiv:1907.10529, 2019. https://arxiv.org/abs/1907.10529
20. Kevin Clark et al., "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators," ICLR 2020. https://arxiv.org/abs/2003.10555
21. Alec Radford et al., "Improving Language Understanding by Generative Pre-Training" (GPT), OpenAI, 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
22. Ting Chen et al., "A Simple Framework for Contrastive Learning of Visual Representations" (SimCLR), arXiv:2002.05709, 2020. https://arxiv.org/abs/2002.05709
23. Alec Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" (CLIP), arXiv:2103.00020, 2021. https://arxiv.org/abs/2103.00020
24. Kaiming He et al., "Masked Autoencoders Are Scalable Vision Learners" (MAE), arXiv:2111.06377, 2021. https://arxiv.org/abs/2111.06377
25. Meta AI, "Introducing Llama 3.1: Our most capable models to date," July 23, 2024. https://ai.meta.com/blog/meta-llama-3-1/
26. NVIDIA, "NVIDIA A100 Tensor Core GPU" datasheet. https://www.nvidia.com/en-us/data-center/a100/
27. NVIDIA, "NVIDIA H100 Tensor Core GPU" datasheet. https://www.nvidia.com/en-us/data-center/h100/
28. Google Cloud, "Cloud TPU v5e" documentation. https://cloud.google.com/tpu/docs/v5e
29. Pengfei Liu et al., "Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing," arXiv:2107.13586, 2021. https://arxiv.org/abs/2107.13586
30. Pranav Rajpurkar, Robin Jia, and Percy Liang, "Know What You Don't Know: Unanswerable Questions for SQuAD" (SQuAD 2.0) and leaderboard. https://rajpurkar.github.io/SQuAD-explorer/
31. Satya Nadella, GitHub Copilot X announcement remarks, March 2023, as reported by GitHub/Microsoft. https://github.blog/news-insights/product-news/github-copilot-x-the-ai-powered-developer-experience/
32. Kaiming He et al., "Deep Residual Learning for Image Recognition" (ResNet), CVPR 2016. https://arxiv.org/abs/1512.03385
33. Alexey Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (ViT), ICLR 2021. https://arxiv.org/abs/2010.11929
34. Alexei Baevski et al., "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations," arXiv:2006.11477, 2020. https://arxiv.org/abs/2006.11477
35. Robin Rombach et al., "High-Resolution Image Synthesis with Latent Diffusion Models" (Stable Diffusion), CVPR 2022. https://arxiv.org/abs/2112.10752
36. Jean-Baptiste Alayrac et al., "Flamingo: a Visual Language Model for Few-Shot Learning," arXiv:2204.14198, 2022. https://arxiv.org/abs/2204.14198
37. IBM, "What is fine-tuning?" https://www.ibm.com/think/topics/fine-tuning
38. David Patterson et al., "Carbon Emissions and Large Neural Network Training," arXiv:2104.10350, 2021. https://arxiv.org/abs/2104.10350
39. Pengfei Li et al., "Making AI Less Thirsty: Uncovering and Addressing the Secret Water Footprint of AI Models," arXiv:2304.03271, 2023. https://arxiv.org/abs/2304.03271
40. International Energy Agency, "Electricity 2024," analysis of data-centre and AI electricity demand. https://www.iea.org/reports/electricity-2024
41. Emily M. Bender, Timnit Gebru, et al., "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" FAccT 2021. https://dl.acm.org/doi/10.1145/3442188.3445922
42. Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan, "Semantics derived automatically from language corpora contain human-like biases," Science, 2017. https://www.science.org/doi/10.1126/science.aal4230
43. Jesse Dodge et al., "Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus" (C4), EMNLP 2021. https://arxiv.org/abs/2104.08758
44. Pranav Narayanan Venkit et al., "A Study of Implicit Bias in Pretrained Language Models against People with Disabilities," COLING 2022. https://aclanthology.org/2022.coling-1.113/
45. Stanford HAI, "Artificial Intelligence Index Report 2024." https://aiindex.stanford.edu/report/
46. Albert Q. Jiang et al., "Mixtral of Experts," arXiv:2401.04088, 2024. https://arxiv.org/abs/2401.04088
47. Tri Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness," arXiv:2205.14135, 2022. https://arxiv.org/abs/2205.14135
48. Zhenzhong Lan et al., "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations," arXiv:1909.11942, 2019. https://arxiv.org/abs/1909.11942
49. Jean-Baptiste Alayrac et al., "Flamingo: a Visual Language Model for Few-Shot Learning," arXiv:2204.14198, 2022. https://arxiv.org/abs/2204.14198
50. Suchin Gururangan et al., "Don't Stop Pretraining: Adapt Language Models to Domains and Tasks," ACL 2020. https://arxiv.org/abs/2004.10964
51. James Kirkpatrick et al., "Overcoming catastrophic forgetting in neural networks," PNAS, 2017. https://www.pnas.org/doi/10.1073/pnas.1611835114
52. Ilya Sutskever, NeurIPS 2024 keynote remarks on the end of pre-training scaling ("we have but one internet"), as reported by The Verge. https://www.theverge.com/2024/12/13/24320811/what-ilya-sutskever-sees-openai-model-data-training
53. Yuntao Bai et al., "Constitutional AI: Harmlessness from AI Feedback," Anthropic, arXiv:2212.08073, 2022. https://arxiv.org/abs/2212.08073
54. Rafael Rafailov et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," arXiv:2305.18290, 2023. https://arxiv.org/abs/2305.18290

## External links

- [Hugging Face Model Hub](https://huggingface.co/models) - Repository of pre-trained models

- [BERT GitHub Repository](https://github.com/google-research/bert) - Original BERT implementation

- [GPT-3 Applications](https://openai.com/research/gpt-3-apps) - Examples of GPT-3 use cases

- [TensorFlow Hub](https://www.tensorflow.org/hub) - Pre-trained model repository

- [PyTorch Hub](https://pytorch.org/hub/) - Pre-trained models for PyTorch

- [Common Crawl](https://commoncrawl.org/) - Large-scale web crawl data

- [ImageNet](https://www.image-net.org/) - Visual database for object recognition