Pre-training

Artificial Intelligence Computer Vision Deep Learning Machine Learning Natural Language Processing Training & Optimization

19 min read

Updated Jun 20, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 20, 2026

Fact-checked

In review queue

Sources

54 citations

Revision

v5 · 3,755 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Pre-training is the first and most compute-intensive stage of building a modern AI model: a neural network is trained on a massive, mostly unlabeled dataset using self-supervised learning to learn general-purpose representations, before it is later adapted to specific tasks through fine-tuning.^[1] The pre-trained result is a versatile foundation model, which the team that coined the term defined as "any model that is trained on broad data that can be adapted to a wide range of downstream tasks."^[2] This two-stage approach has reshaped artificial intelligence across language, vision, and multimodal domains, letting models reach high accuracy with far less task-specific labeled data than training from scratch requires.^[1]

Modern foundation models like GPT-4, BERT, and CLIP derive their capabilities primarily from pre-training on trillions of tokens or hundreds of millions of images, learning rich patterns that transfer effectively across countless downstream applications.^[3] Pre-training addresses the fundamental challenge of data scarcity: rather than requiring millions of labeled examples for each task, a model pre-trained on general data can adapt to new tasks with mere thousands or even dozens of examples.

What is pre-training in machine learning?

Pre-training is the initial training phase of a model on a broad dataset or task to learn general patterns and representations before it is fine-tuned on a specific problem.^[1] In this stage, a model (often called a foundation model when it is general-purpose) is trained on large-scale data, frequently using unlabeled data and self-supervised learning objectives, to acquire a broad understanding of features or knowledge.^[3] The objective is typically self-supervised, such as predicting the next word in a sentence or filling in masked parts of an image, so no human labels are needed and the data itself supplies the supervision signal.

History

Early foundations (2006-2012)

Pre-training's conceptual roots trace to 2006, when Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh published "A Fast Learning Algorithm for Deep Belief Nets" in Neural Computation, introducing greedy, layer-wise unsupervised pre-training for deep belief networks using Restricted Boltzmann Machines.^[4] This breakthrough enabled training deep networks that were previously intractable due to vanishing gradient problems: each layer was pre-trained unsupervised, then the full network was fine-tuned with supervision.^[5]

The paradigm shifted dramatically in 2012 when AlexNet won the ImageNet Large Scale Visual Recognition Challenge with a top-5 error of 15.3%, compared to 26.2% for the second-place entry, a margin of 10.9 percentage points.^[6] This established supervised large-scale pre-training as the dominant transfer learning approach for computer vision.

Word embeddings era (2013-2017)

Word embeddings emerged as natural language processing's first scalable pre-training method. Word2Vec (2013) introduced CBOW and Skip-gram architectures that learned dense vector representations capturing semantic relationships.^[7] GloVe (2014) combined global matrix factorization with local context windows.^[8] These static embeddings gave way to contextualized representations with ELMo (2018), which used bidirectional LSTMs to generate different embeddings for the same word based on context.^[9]

Transformer revolution (2017-present)

The paper "Attention Is All You Need," posted to arXiv on June 12, 2017, introduced transformers, fundamentally changing AI.^[10] The architecture eliminated recurrence and convolution, relying entirely on multi-head self-attention mechanisms that compute relationships between all positions in parallel.

BERT's October 2018 release demonstrated transformers' power for language understanding through bidirectional pre-training using masked language modeling. As Devlin et al. wrote, "BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers."^[11] GPT paralleled BERT with a unidirectional, autoregressive approach. With GPT-3 (2020), OpenAI introduced a 175-billion-parameter model, which the paper described as "10x more than any previous non-sparse language model," demonstrating that scale plus autoregressive pre-training yields emergent capabilities like few-shot learning.^[12]

Core Concepts

The Two-Stage Paradigm: Pre-training and Fine-tuning

The development of large-scale AI models is now dominated by a two-stage paradigm that separates general knowledge acquisition from task-specific adaptation.^[13]

Stage 1: Pre-training: In this initial, computationally intensive phase, a model is trained on a massive, often unlabeled, dataset. The objective is typically self-supervised, such as predicting the next word in a sentence or filling in missing parts of an image.^[13] This stage is where the model learns fundamental concepts, from the grammar and syntax of language to the textures and shapes of visual objects. The result of this stage is a pre-trained model, which serves as a versatile foundation.^[5]
Stage 2: Fine-tuning: The pre-trained model is then adapted for a specific application by continuing the training process on a much smaller, task-specific, and typically labeled dataset. For example, a language model pre-trained on the entire internet might be fine-tuned on a dataset of customer reviews to perform sentiment analysis.^[11] This step adjusts the model's pre-existing parameters to specialize its knowledge for the target task.^[14]

This two-stage approach represents a fundamental philosophical shift in machine learning, moving the field away from building highly specialized, single-task models from scratch toward creating generalist, reusable foundation models.

How does pre-training relate to transfer learning?

Pre-training is the core mechanism that enables transfer learning.^[14]^[15] Transfer learning is a machine learning method where a model developed for one task is reused as the starting point for a model on a second, related task.^[15] The pre-trained model is the tangible artifact that stores the knowledge to be transferred.

This transfer can be implemented in two primary ways:

Feature Extraction: The pre-trained model is used as a fixed feature extractor with parameters frozen, and only new task-specific layers are trained.^[16]
Fine-tuning: Some or all of the pre-trained model's parameters are updated during training on the new dataset, allowing deeper adaptation to the new task.^[16]

Technical Approach

Core Mechanics

Pre-training operates through self-supervised learning objectives that enable models to extract knowledge from unlabeled data at massive scale. The process involves:

Data collection: Gathering datasets from sources like Common Crawl (15+ trillion tokens) or ImageNet (14+ million images)
Pretext task design: Creating self-supervised objectives like predicting masked words, forecasting next tokens, or matching image-text pairs
Training process:

Forward propagation through deep neural networks (typically transformers with 12-96 layers)
Calculating loss against self-supervised objectives
Backpropagating gradients
Updating billions of parameters using optimizers like Adam or AdamW

Modern pre-training runs for weeks to months on thousands of GPUs or TPUs, processing datasets measured in terabytes.^[17]

Pre-training Objectives

Masked Language Modeling (MLM)

Popularized by BERT, MLM randomly masks approximately 15% of input tokens and trains models to predict them using full bidirectional context.^[11] The masking strategy:

80% of selected tokens become [MASK]
10% swap to random tokens
10% remain unchanged

Variants include:

RoBERTa: Removed next sentence prediction, used dynamic masking, trained on 10x more data^[18]
SpanBERT: Masks contiguous spans rather than individual tokens^[19]
ELECTRA: Discriminative pre-training detecting replaced tokens^[20]

Autoregressive Language Modeling

The GPT family employs autoregressive pre-training, predicting each token given all previous tokens in left-to-right fashion.^[21] The objective maximizes likelihood: P(x₁, ..., xₙ) = ∏P(xᵢ|x₁, ..., xᵢ₋₁)

Contrastive Learning

Contrastive learning revolutionized self-supervised pre-training in computer vision and multimodal domains:

SimCLR: Learns representations maximizing agreement between augmented views of same image^[22]
CLIP: Jointly trains image and text encoders on 400 million image-text pairs^[23]
MAE: Masks 75% of image patches and reconstructs missing pixels^[24]

Feature	Generative Pre-training	Contrastive Pre-training
Core Objective	Reconstruct or predict parts of the input data; model the data distribution	Learn an embedding space where similar samples are close and dissimilar samples are far apart
Supervision Signal	The input data itself (for example the original unmasked token, the complete image)	The relationship between pairs of data points (positive vs. negative pairs)
Typical Architectures	Autoencoders (AEs, VAEs), GANs, Autoregressive Models (for example Transformers)	Siamese Networks, models using InfoNCE or Triplet Loss objectives (for example SimCLR, MoCo)
Example Pretext Tasks	Masked Language Modeling (BERT), Next-Token Prediction (GPT), Image Inpainting, Denoising	Identifying an augmented version of an image from a batch of other images
Strengths	Can generate new data; learns a rich, dense representation of the data distribution	Often learns representations highly effective for downstream classification tasks; can be more sample-efficient
Weaknesses	Can be computationally expensive; may have inferior data scaling capacity	Can be data-hungry and prone to over-fitting on limited data; sensitive to negative sample choice

Datasets

Language Datasets

Dataset	Size	Description	Used by
Common Crawl	320+ TB raw	Web crawl data	Most modern LLMs
C4	750 GB	Cleaned Common Crawl	T5, many others
The Pile	825 GB	22 diverse sources	GPT-Neo, GPT-J
RefinedWeb	5+ trillion tokens	Filtered Common Crawl	Falcon
RedPajama	1.2 trillion tokens	Open reproduction of LLaMA data	Open models
BookCorpus	800M words	11,000+ books	BERT, GPT
Wikipedia	2.5B words (English)	Encyclopedia articles	BERT, GPT, most LLMs

Vision Datasets

Dataset	Size	Description	Primary use
ImageNet	1.2M images	1000 object classes	Supervised pre-training
JFT-300M	300M images	Google internal dataset	Large-scale pre-training
LAION-5B	5.85B pairs	Image-text pairs from web	CLIP-style training
DataComp	12.8B pairs	CommonPool for research	Multimodal research
COCO	330K images	Object detection/segmentation	Vision tasks

How much does pre-training cost?

Pre-training a frontier model is among the most expensive computations in industry, with budgets ranging from a few hundred dollars for a small encoder to tens of millions of dollars for the largest models. Lambda Labs estimated that pre-training GPT-3 required roughly 3.14 x 10^23 floating-point operations, which at $1.5 per GPU-hour on a V100 server would cost about $4.6 million.^[12] Meta reported that pre-training Llama 3.1 405B on more than 15 trillion tokens took 30.84 million GPU-hours on a cluster of 16,384 NVIDIA H100 80GB GPUs.^[25]

Training Costs

Model	Parameters	Training time	Hardware	Estimated cost
BERT-Base	110M	4 days	16 TPUs	$500-1,000
GPT-3	175B	~34 days	1024 A100s	$4.6 million
Llama 2 (7B)	7B	1-2 weeks	64-128 A100s	$200,000-500,000
Llama 3.1 (405B)	405B	30.84M GPU-hours	16,384 H100s	$10-20 million
T5	11B	4 weeks	256 TPU v3	$1.5 million
CLIP	400M	12 days	592 V100s	$600,000

Hardware Evolution

NVIDIA A100 (2020): 312 TFLOPS, 40/80GB memory, workhorse for 2020-2023 training^[26]
NVIDIA H100 (2022): 2-3x faster than A100, becoming standard for frontier models^[27]
Google TPU v5e (2023): Pods with 50,944 chips achieving 10 exaFLOPS^[28]

Notable Pre-Trained Models

Model	Domain	Release Year	Parameters	Key Objective	Developer
Word2Vec	NLP	2013	300 dim	Skip-gram/CBOW	Google
ResNet-50	CV	2015	25M	Image Classification	Microsoft
BERT	NLP	2018	340M	MLM, NSP	Google
GPT-3	NLP	2020	175B	Autoregressive LM	OpenAI
RoBERTa	NLP	2019	355M	Dynamic MLM	Facebook
T5	NLP	2019	11B	Text-to-Text	Google
Vision Transformer	CV	2020	86M-632M	Image Classification	Google
CLIP	Multimodal	2021	400M	Contrastive Alignment	OpenAI
DALL-E	Multimodal	2021	12B	Text-to-Image	OpenAI
ELECTRA	NLP	2020	340M	Replaced Token Detection	Google
XLNet	NLP	2019	340M	Permutation LM	Google/CMU
Llama 2	NLP	2023	7B-70B	Autoregressive LM	Meta
Flamingo	Multimodal	2022	80B	Visual Language	DeepMind

Applications

Natural Language Processing

Pre-trained language models power nearly all modern NLP applications:^[14]^[29]

Question answering: Models achieve 89.91% F1 on SQuAD 2.0, approaching human performance^[30]
Code generation: Microsoft CEO Satya Nadella stated in March 2023 that GitHub Copilot was writing 46% of code in files where it is enabled^[31]
Conversational AI: Powering chatbots and virtual assistants like ChatGPT
Machine translation: Near-human quality on many language pairs
Sentiment analysis: 90%+ accuracy for review and social media monitoring
Text summarization: Condensing documents while preserving key information

Computer Vision

Pre-trained vision models serve as backbones for diverse applications:^[32]

Image classification: Vision Transformers achieve 88.5%+ ImageNet accuracy^[33]
Object detection: 50+ box AP on COCO
Medical imaging: 90%+ accuracy for pathology detection, cancer screening in CT scans and MRIs
Autonomous vehicles: Real-time object detection for pedestrians, vehicles, traffic signs
Industrial automation: Quality control, safety monitoring, defect detection

Speech and Multimodal Systems

Speech Recognition: Pre-training builds robust speech-to-text systems less sensitive to accents and noise^[34]
Text-to-image generation: Stable Diffusion uses CLIP embeddings for image synthesis^[35]
Visual question answering: Flamingo achieves state-of-the-art on 16 benchmarks^[36]
Zero-shot classification: CLIP matches supervised models without task-specific training

Benefits and Advantages

Pre-training offers several critical advantages:^[37]

Resource efficiency: Reduces labeled data requirements by 10-100x
Faster development: Fine-tuning takes hours/days vs. weeks/months from scratch
Better performance: Pre-trained models consistently outperform random initialization
Transfer learning: Knowledge transfers across related tasks and domains
Democratization: Smaller teams can leverage frontier model capabilities
Generalization: Models learn robust features that work across diverse applications

Challenges and Limitations

Environmental Impact

The computational demands of pre-training create significant environmental costs:

Carbon emissions: GPT-3 training produced an estimated 552 metric tons of CO2 (about 1.2 million pounds), per a 2021 Google and UC Berkeley analysis^[38]
Water consumption: Estimated 700,000 liters for cooling during GPT-3 training^[39]
Energy use: ChatGPT queries have been estimated to use roughly 10x more energy than a Google search^[40]

Bias, Fairness, and Ethical Considerations

Pre-trained models inherit and amplify societal biases present in training data:^[41]

Gender bias: Models associate professions with specific genders (nurse→women, engineer→men)^[42]
Racial and ethnic bias: Preference for stereotypically Caucasian names in leadership recommendations^[42]
Linguistic bias: C4 filters African American English at 42% vs 6.2% for White American English^[43]
Disability bias: Perpetuation of negative stereotypes about people with disabilities^[44]

Data and Legal Issues

Copyright concerns: Active litigation regarding training on copyrighted content
Privacy violations: Models may memorize and reproduce personal information
Data contamination: Benchmarks appearing in training data inflate scores
Data quality: Web-scraped data contains misinformation, toxicity, and bias

Accessibility and Centralization

Geographic concentration: The 2024 Stanford AI Index reported that the United States produced 40 notable AI models in 2023, far more than China's 15^[45]
Compute barriers: Training frontier models requires $10-100M+ in resources
Hardware costs: A single H100 GPU has carried list and street prices in the $25,000-40,000 range
Technical expertise: Requires specialized knowledge in distributed systems and optimization

Future Directions

Efficiency Improvements

Mixture of Experts: Mixtral 8x7B achieves 70B performance with 13B active parameters^[46]
Knowledge distillation: Creating smaller models matching larger model performance
Quantization: Reducing precision from FP16 to INT8/INT4 with minimal accuracy loss
Flash Attention: 2-4x speedup through optimized attention computation^[47]
ALBERT: Parameter sharing reduces model size by 18x^[48]

Multimodal and Lifelong Learning

Multimodal pre-training: Models processing text, images, audio, and video seamlessly^[49]
Continual pre-training: Updating models with new data without full retraining^[50]
Domain adaptation: Specializing models for medicine, law, science
Lifelong learning: Overcoming catastrophic forgetting to learn continuously^[51]

Beyond Pre-training: Peak Data and Alignment

Some researchers suggest the era of scaling pre-training datasets is ending:^[52]

Peak data hypothesis: Exhausting high-quality public data sources
Post-training focus: Greater emphasis on alignment techniques like RLHF
Constitutional AI: Self-improvement guided by explicit principles^[53]
Direct Preference Optimization: More efficient alternatives to RLHF^[54]
Agentic AI: Systems learning from environment interaction rather than static datasets

References

IBM, "What is pre-training?" IBM Think Topics. https://www.ibm.com/think/topics/pretraining ↩
Rishi Bommasani et al., "On the Opportunities and Risks of Foundation Models," Center for Research on Foundation Models (CRFM), Stanford University, 2021. https://arxiv.org/abs/2108.07258 ↩
Stanford HAI, "Reflections on Foundation Models," Stanford Institute for Human-Centered AI. https://hai.stanford.edu/news/reflections-foundation-models ↩
Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh, "A Fast Learning Algorithm for Deep Belief Nets," Neural Computation, vol. 18, no. 7, pp. 1527-1554, July 2006. https://www.cs.toronto.edu/~hinton/absps/fastnc.pdf ↩
Google Cloud, "What is a foundation model?" https://cloud.google.com/discover/what-is-a-foundation-model ↩
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," Advances in Neural Information Processing Systems 25 (NeurIPS 2012). https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html ↩
Tomas Mikolov et al., "Efficient Estimation of Word Representations in Vector Space," arXiv:1301.3781, 2013. https://arxiv.org/abs/1301.3781 ↩
Jeffrey Pennington, Richard Socher, and Christopher D. Manning, "GloVe: Global Vectors for Word Representation," EMNLP 2014. https://nlp.stanford.edu/projects/glove/ ↩
Matthew E. Peters et al., "Deep Contextualized Word Representations" (ELMo), NAACL 2018. https://arxiv.org/abs/1802.05365 ↩
Ashish Vaswani et al., "Attention Is All You Need," arXiv:1706.03762, June 12, 2017. https://arxiv.org/abs/1706.03762 ↩
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv:1810.04805, October 2018. https://arxiv.org/abs/1810.04805 ↩
Tom B. Brown et al., "Language Models are Few-Shot Learners" (GPT-3), arXiv:2005.14165, 2020. https://arxiv.org/abs/2005.14165 ; cost estimate from Lambda Labs, "OpenAI's GPT-3 Language Model: A Technical Overview." https://lambda.ai/blog/demystifying-gpt-3 ↩
Google Cloud, "What is generative AI?" https://cloud.google.com/use-cases/generative-ai ↩
Sebastian Ruder, "Transfer Learning - Machine Learning's Next Frontier," 2017. https://www.ruder.io/transfer-learning/ ↩
IBM, "What is transfer learning?" https://www.ibm.com/think/topics/transfer-learning ↩
PyTorch, "Transfer Learning for Computer Vision Tutorial." https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html ↩
Jared Kaplan et al., "Scaling Laws for Neural Language Models," arXiv:2001.08361, 2020. https://arxiv.org/abs/2001.08361 ↩
Yinhan Liu et al., "RoBERTa: A Robustly Optimized BERT Pretraining Approach," arXiv:1907.11692, 2019. https://arxiv.org/abs/1907.11692 ↩
Mandar Joshi et al., "SpanBERT: Improving Pre-training by Representing and Predicting Spans," arXiv:1907.10529, 2019. https://arxiv.org/abs/1907.10529 ↩
Kevin Clark et al., "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators," ICLR 2020. https://arxiv.org/abs/2003.10555 ↩
Alec Radford et al., "Improving Language Understanding by Generative Pre-Training" (GPT), OpenAI, 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf ↩
Ting Chen et al., "A Simple Framework for Contrastive Learning of Visual Representations" (SimCLR), arXiv:2002.05709, 2020. https://arxiv.org/abs/2002.05709 ↩
Alec Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" (CLIP), arXiv:2103.00020, 2021. https://arxiv.org/abs/2103.00020 ↩
Kaiming He et al., "Masked Autoencoders Are Scalable Vision Learners" (MAE), arXiv:2111.06377, 2021. https://arxiv.org/abs/2111.06377 ↩
Meta AI, "Introducing Llama 3.1: Our most capable models to date," July 23, 2024. https://ai.meta.com/blog/meta-llama-3-1/ ↩
NVIDIA, "NVIDIA A100 Tensor Core GPU" datasheet. https://www.nvidia.com/en-us/data-center/a100/ ↩
NVIDIA, "NVIDIA H100 Tensor Core GPU" datasheet. https://www.nvidia.com/en-us/data-center/h100/ ↩
Google Cloud, "Cloud TPU v5e" documentation. https://cloud.google.com/tpu/docs/v5e ↩
Pengfei Liu et al., "Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing," arXiv:2107.13586, 2021. https://arxiv.org/abs/2107.13586 ↩
Pranav Rajpurkar, Robin Jia, and Percy Liang, "Know What You Don't Know: Unanswerable Questions for SQuAD" (SQuAD 2.0) and leaderboard. https://rajpurkar.github.io/SQuAD-explorer/ ↩
Satya Nadella, GitHub Copilot X announcement remarks, March 2023, as reported by GitHub/Microsoft. https://github.blog/news-insights/product-news/github-copilot-x-the-ai-powered-developer-experience/ ↩
Kaiming He et al., "Deep Residual Learning for Image Recognition" (ResNet), CVPR 2016. https://arxiv.org/abs/1512.03385 ↩
Alexey Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (ViT), ICLR 2021. https://arxiv.org/abs/2010.11929 ↩
Alexei Baevski et al., "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations," arXiv:2006.11477, 2020. https://arxiv.org/abs/2006.11477 ↩
Robin Rombach et al., "High-Resolution Image Synthesis with Latent Diffusion Models" (Stable Diffusion), CVPR 2022. https://arxiv.org/abs/2112.10752 ↩
Jean-Baptiste Alayrac et al., "Flamingo: a Visual Language Model for Few-Shot Learning," arXiv:2204.14198, 2022. https://arxiv.org/abs/2204.14198 ↩
IBM, "What is fine-tuning?" https://www.ibm.com/think/topics/fine-tuning ↩
David Patterson et al., "Carbon Emissions and Large Neural Network Training," arXiv:2104.10350, 2021. https://arxiv.org/abs/2104.10350 ↩
Pengfei Li et al., "Making AI Less Thirsty: Uncovering and Addressing the Secret Water Footprint of AI Models," arXiv:2304.03271, 2023. https://arxiv.org/abs/2304.03271 ↩
International Energy Agency, "Electricity 2024," analysis of data-centre and AI electricity demand. https://www.iea.org/reports/electricity-2024 ↩
Emily M. Bender, Timnit Gebru, et al., "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" FAccT 2021. https://dl.acm.org/doi/10.1145/3442188.3445922 ↩
Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan, "Semantics derived automatically from language corpora contain human-like biases," Science, 2017. https://www.science.org/doi/10.1126/science.aal4230 ↩
Jesse Dodge et al., "Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus" (C4), EMNLP 2021. https://arxiv.org/abs/2104.08758 ↩
Pranav Narayanan Venkit et al., "A Study of Implicit Bias in Pretrained Language Models against People with Disabilities," COLING 2022. https://aclanthology.org/2022.coling-1.113/ ↩
Stanford HAI, "Artificial Intelligence Index Report 2024." https://aiindex.stanford.edu/report/ ↩
Albert Q. Jiang et al., "Mixtral of Experts," arXiv:2401.04088, 2024. https://arxiv.org/abs/2401.04088 ↩
Tri Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness," arXiv:2205.14135, 2022. https://arxiv.org/abs/2205.14135 ↩
Zhenzhong Lan et al., "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations," arXiv:1909.11942, 2019. https://arxiv.org/abs/1909.11942 ↩
Jean-Baptiste Alayrac et al., "Flamingo: a Visual Language Model for Few-Shot Learning," arXiv:2204.14198, 2022. https://arxiv.org/abs/2204.14198 ↩
Suchin Gururangan et al., "Don't Stop Pretraining: Adapt Language Models to Domains and Tasks," ACL 2020. https://arxiv.org/abs/2004.10964 ↩
James Kirkpatrick et al., "Overcoming catastrophic forgetting in neural networks," PNAS, 2017. https://www.pnas.org/doi/10.1073/pnas.1611835114 ↩
Ilya Sutskever, NeurIPS 2024 keynote remarks on the end of pre-training scaling ("we have but one internet"), as reported by The Verge. https://www.theverge.com/2024/12/13/24320811/what-ilya-sutskever-sees-openai-model-data-training ↩
Yuntao Bai et al., "Constitutional AI: Harmlessness from AI Feedback," Anthropic, arXiv:2212.08073, 2022. https://arxiv.org/abs/2212.08073 ↩
Rafael Rafailov et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," arXiv:2305.18290, 2023. https://arxiv.org/abs/2305.18290 ↩

External links

Hugging Face Model Hub - Repository of pre-trained models
BERT GitHub Repository - Original BERT implementation
GPT-3 Applications - Examples of GPT-3 use cases
TensorFlow Hub - Pre-trained model repository
PyTorch Hub - Pre-trained models for PyTorch
Common Crawl - Large-scale web crawl data
ImageNet - Visual database for object recognition

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit