Pre-training
Template:Infobox machine learning technique
Pre-training is a foundational machine learning paradigm where neural network models first learn general representations from massive unlabeled datasets before being adapted to specific downstream tasks through fine-tuning. This two-stage approach has revolutionized artificial intelligence across domains, enabling models to achieve superior performance with dramatically less task-specific labeled data than training from scratch.[1]
Pre-training is the initial training phase of a model on a broad dataset or task to learn general patterns and representations before being fine-tuned on a specific problem.[2] In this stage, a model (often called a foundation model when it is general-purpose) is trained on large-scale data, frequently using unlabeled data and self-supervised learning objectives, to acquire a broad understanding of features or knowledge.[3]
Modern foundation models like GPT-4, BERT, and CLIP derive their capabilities primarily from pre-training on trillions of tokens or billions of images, learning rich patterns that transfer effectively across countless downstream applications.[3] Pre-training addresses the fundamental challenge of data scarcity: rather than requiring millions of labeled examples for each task, models pre-trained on general data can adapt to new tasks with mere thousands or even dozens of examples.
History
Early foundations (2006-2012)
Pre-training's conceptual roots trace to 2006-2007 when Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh introduced unsupervised layer-wise pre-training for deep belief networks using Restricted Boltzmann Machines.[4] This breakthrough enabled training deep networks that were previously intractable due to vanishing gradient problems—each layer pre-trained unsupervised, then the full network fine-tuned with supervision.[5]
The paradigm shifted dramatically in 2012 when AlexNet won ImageNet with a top-5 error of 15.3% compared to 26.2% for second place—an unprecedented 10.9% margin.[6] This established supervised large-scale pre-training as the dominant transfer learning approach for computer vision.
Word embeddings era (2013-2017)
Word embeddings emerged as natural language processing's first scalable pre-training method. Word2Vec (2013) introduced CBOW and Skip-gram architectures that learned dense vector representations capturing semantic relationships.[7] GloVe (2014) combined global matrix factorization with local context windows.[8] These static embeddings gave way to contextualized representations with ELMo (2018), which used bidirectional LSTMs to generate different embeddings for the same word based on context.[9]
Transformer revolution (2017-present)
The June 2017 paper "Attention Is All You Need" introduced transformers, fundamentally changing AI.[10] The architecture eliminated recurrence and convolution, relying entirely on multi-head self-attention mechanisms that compute relationships between all positions in parallel.
BERT's October 2018 release demonstrated transformers' power for language understanding through bidirectional pre-training using masked language modeling.[11] GPT paralleled BERT with a unidirectional approach, with GPT-3 (2020) demonstrating that scale plus autoregressive pre-training yields emergent capabilities like few-shot learning.[12]
Core Concepts
The Two-Stage Paradigm: Pre-training and Fine-tuning
The development of large-scale AI models is now dominated by a two-stage paradigm that separates general knowledge acquisition from task-specific adaptation.[13]
- Stage 1: Pre-training: In this initial, computationally intensive phase, a model is trained on a massive, often unlabeled, dataset. The objective is typically self-supervised, such as predicting the next word in a sentence or filling in missing parts of an image.[13] This stage is where the model learns fundamental concepts, from the grammar and syntax of language to the textures and shapes of visual objects. The result of this stage is a pre-trained model, which serves as a versatile foundation.[5]
- Stage 2: Fine-tuning: The pre-trained model is then adapted for a specific application by continuing the training process on a much smaller, task-specific, and typically labeled dataset. For example, a language model pre-trained on the entire internet might be fine-tuned on a dataset of customer reviews to perform sentiment analysis.[11] This step adjusts the model's pre-existing parameters to specialize its knowledge for the target task.[14]
This two-stage approach represents a fundamental philosophical shift in machine learning, moving the field away from building highly specialized, single-task models from scratch toward creating generalist, reusable foundation models.
Relationship with Transfer Learning
Pre-training is the core mechanism that enables transfer learning.[14][15] Transfer learning is a machine learning method where a model developed for one task is reused as the starting point for a model on a second, related task.[15] The pre-trained model is the tangible artifact that stores the knowledge to be transferred.
This transfer can be implemented in two primary ways:
- Feature Extraction: The pre-trained model is used as a fixed feature extractor with parameters frozen, and only new task-specific layers are trained.[16]
- Fine-tuning: Some or all of the pre-trained model's parameters are updated during training on the new dataset, allowing deeper adaptation to the new task.[16]
Technical Approach
Core Mechanics
Pre-training operates through self-supervised learning objectives that enable models to extract knowledge from unlabeled data at massive scale. The process involves:
- Data collection: Gathering datasets from sources like Common Crawl (15+ trillion tokens) or ImageNet (14+ million images)
- Pretext task design: Creating self-supervised objectives like predicting masked words, forecasting next tokens, or matching image-text pairs
- Training process:
- Forward propagation through deep neural networks (typically transformers with 12-96 layers)
- Calculating loss against self-supervised objectives
- Backpropagating gradients
- Updating billions of parameters using optimizers like Adam or AdamW
Modern pre-training runs for weeks to months on thousands of GPUs or TPUs, processing datasets measured in terabytes.[17]
Pre-training Objectives
Masked Language Modeling (MLM)
Popularized by BERT, MLM randomly masks approximately 15% of input tokens and trains models to predict them using full bidirectional context.[11] The masking strategy:
- 80% of selected tokens become [MASK]
- 10% swap to random tokens
- 10% remain unchanged
Variants include:
- RoBERTa: Removed next sentence prediction, used dynamic masking, trained on 10x more data[18]
- SpanBERT: Masks contiguous spans rather than individual tokens[19]
- ELECTRA: Discriminative pre-training detecting replaced tokens[20]
Autoregressive Language Modeling
The GPT family employs autoregressive pre-training, predicting each token given all previous tokens in left-to-right fashion.[21] The objective maximizes likelihood: P(x₁, ..., xₙ) = ∏P(xᵢ|x₁, ..., xᵢ₋₁)
Contrastive Learning
Contrastive learning revolutionized self-supervised pre-training in computer vision and multimodal domains:
- SimCLR: Learns representations maximizing agreement between augmented views of same image[22]
- CLIP: Jointly trains image and text encoders on 400 million image-text pairs[23]
- MAE: Masks 75% of image patches and reconstructs missing pixels[24]
| Feature | Generative Pre-training | Contrastive Pre-training |
|---|---|---|
| Core Objective | Reconstruct or predict parts of the input data; model the data distribution | Learn an embedding space where similar samples are close and dissimilar samples are far apart |
| Supervision Signal | The input data itself (for example the original unmasked token, the complete image) | The relationship between pairs of data points (positive vs. negative pairs) |
| Typical Architectures | Autoencoders (AEs, VAEs), GANs, Autoregressive Models (for example Transformers) | Siamese Networks, models using InfoNCE or Triplet Loss objectives (for example SimCLR, MoCo) |
| Example Pretext Tasks | Masked Language Modeling (BERT), Next-Token Prediction (GPT), Image Inpainting, Denoising | Identifying an augmented version of an image from a batch of other images |
| Strengths | Can generate new data; learns a rich, dense representation of the data distribution | Often learns representations highly effective for downstream classification tasks; can be more sample-efficient |
| Weaknesses | Can be computationally expensive; may have inferior data scaling capacity | Can be data-hungry and prone to over-fitting on limited data; sensitive to negative sample choice |
Datasets
Language Datasets
| Dataset | Size | Description | Used by |
|---|---|---|---|
| Common Crawl | 320+ TB raw | Web crawl data | Most modern LLMs |
| C4 | 750 GB | Cleaned Common Crawl | T5, many others |
| The Pile | 825 GB | 22 diverse sources | GPT-Neo, GPT-J |
| RefinedWeb | 5+ trillion tokens | Filtered Common Crawl | Falcon |
| RedPajama | 1.2 trillion tokens | Open reproduction of LLaMA data | Open models |
| BookCorpus | 800M words | 11,000+ books | BERT, GPT |
| Wikipedia | 2.5B words (English) | Encyclopedia articles | BERT, GPT, most LLMs |
Vision Datasets
| Dataset | Size | Description | Primary use |
|---|---|---|---|
| ImageNet | 1.2M images | 1000 object classes | Supervised pre-training |
| JFT-300M | 300M images | Google internal dataset | Large-scale pre-training |
| LAION-5B | 5.85B pairs | Image-text pairs from web | CLIP-style training |
| DataComp | 12.8B pairs | CommonPool for research | Multimodal research |
| COCO | 330K images | Object detection/segmentation | Vision tasks |
Computational Requirements
Training Costs
| Model | Parameters | Training time | Hardware | Estimated cost |
|---|---|---|---|---|
| BERT-Base | 110M | 4 days | 16 TPUs | $500-1,000 |
| GPT-3 | 175B | ~34 days | 1024 A100s | $4.6 million |
| Llama 2 (7B) | 7B | 1-2 weeks | 64-128 A100s | $200,000-500,000 |
| Llama 3.1 (405B) | 405B | 30.84M GPU-hours | 24,576 H100s | $10-20 million |
| T5 | 11B | 4 weeks | 256 TPU v3 | $1.5 million |
| CLIP | 400M | 12 days | 592 V100s | $600,000 |
Hardware Evolution
- NVIDIA A100 (2020): 312 TFLOPS, 40/80GB memory, workhorse for 2020-2023 training[25]
- NVIDIA H100 (2022): 2-3x faster than A100, becoming standard for frontier models[26]
- Google TPU v5e (2023): Pods with 50,944 chips achieving 10 exaFLOPS[27]
Notable Pre-Trained Models
| Model | Domain | Release Year | Parameters | Key Objective | Developer |
|---|---|---|---|---|---|
| Word2Vec | NLP | 2013 | 300 dim | Skip-gram/CBOW | |
| ResNet-50 | CV | 2015 | 25M | Image Classification | Microsoft |
| BERT | NLP | 2018 | 340M | MLM, NSP | |
| GPT-3 | NLP | 2020 | 175B | Autoregressive LM | OpenAI |
| RoBERTa | NLP | 2019 | 355M | Dynamic MLM | |
| T5 | NLP | 2019 | 11B | Text-to-Text | |
| Vision Transformer | CV | 2020 | 86M-632M | Image Classification | |
| CLIP | Multimodal | 2021 | 400M | Contrastive Alignment | OpenAI |
| DALL-E | Multimodal | 2021 | 12B | Text-to-Image | OpenAI |
| ELECTRA | NLP | 2020 | 340M | Replaced Token Detection | |
| XLNet | NLP | 2019 | 340M | Permutation LM | Google/CMU |
| Llama 2 | NLP | 2023 | 7B-70B | Autoregressive LM | Meta |
| Flamingo | Multimodal | 2022 | 80B | Visual Language | DeepMind |
Applications
Natural Language Processing
Pre-trained language models power nearly all modern NLP applications:[14][28]
- Question answering: Models achieve 89.91% F1 on SQuAD 2.0, approaching human performance[29]
- Code generation: GitHub Copilot assists with 43% of code written by developers[30]
- Conversational AI: Powering chatbots and virtual assistants like ChatGPT
- Machine translation: Near-human quality on many language pairs
- Sentiment analysis: 90%+ accuracy for review and social media monitoring
- Text summarization: Condensing documents while preserving key information
Computer Vision
Pre-trained vision models serve as backbones for diverse applications:[31]
- Image classification: Vision Transformers achieve 88.5%+ ImageNet accuracy[32]
- Object detection: 50+ box AP on COCO
- Medical imaging: 90%+ accuracy for pathology detection, cancer screening in CT scans and MRIs
- Autonomous vehicles: Real-time object detection for pedestrians, vehicles, traffic signs
- Industrial automation: Quality control, safety monitoring, defect detection
Speech and Multimodal Systems
- Speech Recognition: Pre-training builds robust speech-to-text systems less sensitive to accents and noise[33]
- Text-to-image generation: Stable Diffusion uses CLIP embeddings for image synthesis[34]
- Visual question answering: Flamingo achieves state-of-the-art on 16 benchmarks[35]
- Zero-shot classification: CLIP matches supervised models without task-specific training
Benefits and Advantages
Pre-training offers several critical advantages:[36]
- Resource efficiency: Reduces labeled data requirements by 10-100x
- Faster development: Fine-tuning takes hours/days vs. weeks/months from scratch
- Better performance: Pre-trained models consistently outperform random initialization
- Transfer learning: Knowledge transfers across related tasks and domains
- Democratization: Smaller teams can leverage frontier model capabilities
- Generalization: Models learn robust features that work across diverse applications
Challenges and Limitations
Environmental Impact
The computational demands of pre-training create significant environmental costs:
- Carbon emissions: GPT-3 training produced ~626,000 pounds of CO₂[37]
- Water consumption: Estimated 700,000 liters for cooling during GPT-3 training[38]
- Energy use: ChatGPT queries use ~10x more energy than Google searches[39]
Bias, Fairness, and Ethical Considerations
Pre-trained models inherit and amplify societal biases present in training data:[40]
- Gender bias: Models associate professions with specific genders (nurse→women, engineer→men)[41]
- Racial and ethnic bias: Preference for stereotypically Caucasian names in leadership recommendations[41]
- Linguistic bias: C4 filters African American English at 42% vs 6.2% for White American English[42]
- Disability bias: Perpetuation of negative stereotypes about people with disabilities[43]
Data and Legal Issues
- Copyright concerns: Active litigation regarding training on copyrighted content
- Privacy violations: Models may memorize and reproduce personal information
- Data contamination: Benchmarks appearing in training data inflate scores
- Data quality: Web-scraped data contains misinformation, toxicity, and bias
Accessibility and Centralization
- Geographic concentration: US produces 5x more foundation models than China[44]
- Compute barriers: Training frontier models requires $10-100M+ in resources
- Hardware costs: Single H100 GPU costs $30,000-40,000
- Technical expertise: Requires specialized knowledge in distributed systems and optimization
Future Directions
Efficiency Improvements
- Mixture of Experts: Mixtral 8x7B achieves 70B performance with 13B active parameters[45]
- Knowledge distillation: Creating smaller models matching larger model performance
- Quantization: Reducing precision from FP16 to INT8/INT4 with minimal accuracy loss
- Flash Attention: 2-4x speedup through optimized attention computation[46]
- ALBERT: Parameter sharing reduces model size by 18x[47]
Multimodal and Lifelong Learning
- Multimodal pre-training: Models processing text, images, audio, and video seamlessly[48]
- Continual pre-training: Updating models with new data without full retraining[49]
- Domain adaptation: Specializing models for medicine, law, science
- Lifelong learning: Overcoming catastrophic forgetting to learn continuously[50]
Beyond Pre-training: Peak Data and Alignment
Some researchers suggest the era of scaling pre-training datasets is ending:[51]
- Peak data hypothesis: Exhausting high-quality public data sources
- Post-training focus: Greater emphasis on alignment techniques like RLHF
- Constitutional AI: Self-improvement guided by explicit principles[52]
- Direct Preference Optimization: More efficient alternatives to RLHF[53]
- Agentic AI: Systems learning from environment interaction rather than static datasets
See also
- Transfer learning
- Foundation models
- Fine-tuning
- Self-supervised learning
- Transformer (machine learning model)
- BERT
- GPT-3
- Large language model
- Zero-shot learning
- Few-shot learning
- Masked language modeling
- Contrastive learning
- Word embeddings
References
- ↑ Bengio, Y., Courville, A., & Vincent, P. (2013). "Representation Learning: A Review and New Perspectives". IEEE Transactions on Pattern Analysis and Machine Intelligence. https://arxiv.org/abs/1206.5538
- ↑ NIST CSRC AI Glossary (2025). Definition of "pre-training": "training a general-purpose model on publicly-available data, often followed by fine-tuning for task-specific information." https://csrc.nist.gov/glossary/term/pre_training
- ↑ 3.0 3.1 Bommasani, R., et al. (2021). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. https://arxiv.org/abs/2108.07258
- ↑ Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). "A fast learning algorithm for deep belief nets". Neural Computation, 18(7), 1527-1554. https://www.cs.toronto.edu/~hinton/absps/fastnc.pdf
- ↑ 5.0 5.1 Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., & Bengio, S. (2010). "Why Does Unsupervised Pre-training Help Deep Learning?". Journal of Machine Learning Research. https://jmlr.org/papers/v11/erhan10a.html
- ↑ Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks". NeurIPS. https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
- ↑ Mikolov, T., et al. (2013). "Efficient Estimation of Word Representations in Vector Space". arXiv. https://arxiv.org/abs/1301.3781
- ↑ Pennington, J., Socher, R., & Manning, C. (2014). "GloVe: Global Vectors for Word Representation". EMNLP. https://nlp.stanford.edu/pubs/glove.pdf
- ↑ Peters, M., et al. (2018). "Deep Contextualized Word Representations". NAACL. https://arxiv.org/abs/1802.05365
- ↑ Vaswani, A., et al. (2017). "Attention Is All You Need". NeurIPS. https://arxiv.org/abs/1706.03762
- ↑ 11.0 11.1 11.2 Devlin, J., et al. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL. https://arxiv.org/abs/1810.04805
- ↑ Brown, T., et al. (2020). "Language Models are Few-Shot Learners". NeurIPS. https://arxiv.org/abs/2005.14165
- ↑ 13.0 13.1 Liu, X., et al. (2020). "Self-supervised Learning: Generative or Contrastive". arXiv. https://arxiv.org/abs/2006.08218
- ↑ 14.0 14.1 14.2 Lee, Angie (December 8, 2022). "What Is a Pretrained AI Model?". NVIDIA Blog. https://blogs.nvidia.com/blog/what-is-a-pretrained-ai-model/
- ↑ 15.0 15.1 Amazon Web Services. "What Is Transfer Learning?". https://aws.amazon.com/what-is/transfer-learning/
- ↑ 16.0 16.1 Dev.to. "Understanding the Differences: Fine-Tuning vs. Transfer Learning". https://dev.to/luxdevhq/understanding-the-differences-fine-tuning-vs-transfer-learning-370
- ↑ Touvron, H., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models". Meta AI. https://arxiv.org/abs/2307.09288
- ↑ Liu, Y., et al. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach". arXiv. https://arxiv.org/abs/1907.11692
- ↑ Joshi, M., et al. (2020). "SpanBERT: Improving Pre-training by Representing and Predicting Spans". TACL. https://arxiv.org/abs/1907.10529
- ↑ Clark, K., et al. (2020). "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators". ICLR. https://arxiv.org/abs/2003.10555
- ↑ Radford, A., et al. (2018). "Improving Language Understanding by Generative Pre-Training". OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
- ↑ Chen, T., et al. (2020). "A Simple Framework for Contrastive Learning of Visual Representations". ICML. https://arxiv.org/abs/2002.05709
- ↑ Radford, A., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision". ICML. https://arxiv.org/abs/2103.00020
- ↑ He, K., et al. (2022). "Masked Autoencoders Are Scalable Vision Learners". CVPR. https://arxiv.org/abs/2111.06377
- ↑ NVIDIA. (2020). "NVIDIA A100 Tensor Core GPU Architecture". https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf
- ↑ NVIDIA. (2022). "NVIDIA H100 Tensor Core GPU Architecture". https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet
- ↑ Google Cloud. (2023). "Cloud TPU v5e". https://cloud.google.com/tpu/docs/v5e
- ↑ Toloka. "What is Pre-training in LLM Development?". https://toloka.ai/blog/pre-training-in-llm-development/
- ↑ Rajpurkar, P., Jia, R., & Liang, P. (2018). "Know What You Don't Know: Unanswerable Questions for SQuAD". ACL. https://arxiv.org/abs/1806.03822
- ↑ GitHub. (2024). "GitHub Copilot Impact Research". https://github.blog/news-insights/research/
- ↑ Viso.ai. "Top 45 Computer Vision Applications in 2024". https://viso.ai/applications/computer-vision-applications/
- ↑ Dosovitskiy, A., et al. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". ICLR. https://arxiv.org/abs/2010.11929
- ↑ GeeksForGeeks. "What is Pre-training and its Objective?". https://www.geeksforgeeks.org/artificial-intelligence/what-is-pre-training-and-its-objective/
- ↑ Rombach, R., et al. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models". CVPR. https://arxiv.org/abs/2112.10752
- ↑ Alayrac, J. B., et al. (2022). "Flamingo: a Visual Language Model for Few-Shot Learning". NeurIPS. https://arxiv.org/abs/2204.14198
- ↑ Baeldung (2025). "What Does Pre-training a Neural Network Mean?". https://www.baeldung.com/cs/neural-network-pre-training
- ↑ Strubell, E., Ganesh, A., & McCallum, A. (2019). "Energy and Policy Considerations for Deep Learning in NLP". ACL. https://arxiv.org/abs/1906.02243
- ↑ Li, P., et al. (2023). "Making AI Less Thirsty: Uncovering and Addressing the Secret Water Footprint of AI Models". arXiv. https://arxiv.org/abs/2304.03271
- ↑ Wikipedia. "Environmental impact of artificial intelligence". https://en.wikipedia.org/wiki/Environmental_impact_of_artificial_intelligence
- ↑ Google Cloud. "What are foundation models?". https://cloud.google.com/discover/what-are-foundation-models
- ↑ 41.0 41.1 Li, Z., et al. (2024). "Explicitly unbiased large language models still form biased mental models". PNAS. https://www.pnas.org/doi/10.1073/pnas.2416228122
- ↑ Dodge, J., et al. (2021). "Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus". EMNLP. https://arxiv.org/abs/2104.08758
- ↑ Venkit, P. N., et al. (2022). "A Study of Implicit Bias in Pretrained Language Models against People with Disabilities". COLING. https://aclanthology.org/2022.coling-1.113/
- ↑ Maslej, N., et al. (2024). "The AI Index 2024 Annual Report". Stanford HAI. https://aiindex.stanford.edu/report/
- ↑ Jiang, A. Q., et al. (2024). "Mixtral of Experts". arXiv. https://arxiv.org/abs/2401.04088
- ↑ Dao, T., et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness". NeurIPS. https://arxiv.org/abs/2205.14135
- ↑ Lan, Z., et al. (2020). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations". ICLR. https://arxiv.org/abs/1909.11942
- ↑ MultimodalPretraining.github.io. "Workshop on Multimodal Pre-training". https://multimodalpretraining.github.io/
- ↑ Ibrahim, M., et al. (2024). "Simple and Scalable Strategies to Continually Pre-train Large Language Models". arXiv. https://arxiv.org/abs/2403.08763
- ↑ Mehta, S. V. (2023). "Efficient Lifelong Learning in Deep Neural Networks". Carnegie Mellon University. https://kilthub.cmu.edu/articles/thesis/24992883
- ↑ PDFTranslate.ai. "Ilya Sutskever: LLM Pre-training as we know it will end". https://pdftranslate.ai/blog/llm-end-of-era
- ↑ Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback". Anthropic. https://arxiv.org/abs/2212.08073
- ↑ Rafailov, R., et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". NeurIPS. https://arxiv.org/abs/2305.18290
External links
- Hugging Face Model Hub - Repository of pre-trained models
- BERT GitHub Repository - Original BERT implementation
- GPT-3 Applications - Examples of GPT-3 use cases
- TensorFlow Hub - Pre-trained model repository
- PyTorch Hub - Pre-trained models for PyTorch
- Common Crawl - Large-scale web crawl data
- ImageNet - Visual database for object recognition