# Pre-Trained Model

> Source: https://aiwiki.ai/wiki/pre-trained_model
> Updated: 2026-06-25
> Categories: Deep Learning, Machine Learning, Natural Language Processing, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **pre-trained model** is a [machine learning](/wiki/machine_learning) model that has already been trained on a large, general-purpose dataset and can then be reused, either as a fixed feature extractor or by [fine-tuning](/wiki/fine_tuning), for a new downstream task. Instead of building a model from scratch and initializing its [weights](/wiki/weight) randomly, practitioners start from a model that has already learned useful representations from large-scale data, then adapt it with a smaller, task-specific dataset. Pre-trained models are the foundation of [transfer learning](/wiki/transfer_learning) and of modern [foundation models](/wiki/large_language_model), and they are the dominant paradigm in [deep learning](/wiki/deep_neural_network) across [computer vision](/wiki/computer_vision), [natural language processing](/wiki/natural_language_understanding), speech, and multimodal AI.

The canonical workflow has two stages: a compute-intensive **pre-training** stage on broad data (for example, [ImageNet](/wiki/image_recognition) for vision or large web text corpora for language), followed by a cheaper **adaptation** stage for the target task. This pretrain-then-finetune recipe lets teams download a model that already encodes general knowledge and specialize it in hours rather than weeks, cutting cost, training time, and labeled-data requirements while often improving accuracy over training from scratch [5][9]. The largest public catalog of these models, the [Hugging Face](/wiki/hugging_face) Hub, hosted more than 2 million models by 2025, up from one million about eleven months earlier [15].

## Explain like I'm 5 (ELI5)

Imagine you want to teach someone to cook Italian food. You could start by teaching them what a stove is, how to hold a knife, and what salt tastes like. Or you could find someone who already knows how to cook French food and just teach them the differences for Italian recipes. A pre-trained model is like that experienced cook: it already knows lots of general things (how ingredients combine, how heat works), so it can learn a new style of cooking much faster than starting from zero.

## What is a pre-trained model?

A pre-trained model is a network whose parameters were learned on a source task with abundant data, then carried over to a target task. The intuition, established experimentally by Yosinski et al. (2014) at NeurIPS, is that the lower layers of a deep network learn **general** features (edges, color blobs, basic textures or, in language, syntax and word co-occurrence) that are useful far beyond the original task, while the upper layers learn features that are increasingly **specific** to the source task [4]. Because the general features transfer, a model trained once on a large dataset can seed many downstream models. Yosinski et al. found that "initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset" [4].

This approach is central to [transfer learning](/wiki/transfer_learning), where knowledge gained from one task is applied to a different but related task, and it underlies foundation models. As the Stanford report that coined the term put it, foundation models are "models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks" [12].

## History and development

### Early foundations (1976 to 2012)

The concept of transferring learned knowledge between tasks dates back to Stevo Bozinovski's 1976 work on neural network training transfer [1]. However, the modern notion of transfer learning took shape with the work of Thrun and Pratt in 1998 and was later formalized in a survey by Pan and Yang in 2009 [2]. These works established that a model trained on one domain can improve [generalization](/wiki/generalization) on a related domain, particularly when labeled data in the target domain is scarce.

In computer vision, the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) marked a turning point. [AlexNet](/wiki/deep_neural_network), designed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, achieved a top-5 error rate of 15.3%, beating the runner-up by more than 10 percentage points [3]. This result demonstrated that deep [convolutional neural networks](/wiki/convolutional_neural_network) (CNNs) trained on large image datasets could learn highly transferable visual features. AlexNet's success triggered widespread adoption of ImageNet-pretrained CNNs as starting points for downstream vision tasks.

### The ImageNet pretraining era (2013 to 2017)

Following AlexNet, a series of deeper and more sophisticated CNN architectures were developed and pre-trained on [ImageNet](/wiki/image_recognition):

| Model | Year | Key innovation | Parameters | ILSVRC top-5 error |
|---|---|---|---|---|
| [AlexNet](/wiki/deep_neural_network) | 2012 | Deep CNN with ReLU, dropout, GPU training | 60 million | 15.3% |
| [VGGNet](/wiki/vgg) | 2014 | Uniform 3x3 convolutions, increased depth | 138 million (VGG-16) | 7.3% |
| GoogLeNet (Inception) | 2014 | Inception modules with parallel filter sizes | 6.8 million | 6.7% |
| [ResNet](/wiki/resnet) | 2015 | Residual (skip) connections enabling 152+ layers | 60 million (ResNet-152) | 3.6% |
| [DenseNet](/wiki/densenet) | 2017 | Dense connections between all layers | 20 million (DenseNet-201) | ~3.5% |

Research by Yosinski et al. (2014), published at NeurIPS, experimentally quantified the transferability of features at different layers of a deep neural network [4]. The study found that early layers learn general features (such as edge detectors and color blobs) applicable to many tasks, while later layers become increasingly task-specific. Initializing a network with transferred features from almost any number of layers produced a boost to generalization, even after fine-tuning to a new dataset. This finding provided the theoretical justification for using ImageNet-pretrained models as feature extractors or initialization points across nearly all computer vision tasks, including [object detection](/wiki/object_detection), [image segmentation](/wiki/image_segmentation), and [pose estimation](/wiki/pose_estimation).

### Word embeddings and early NLP pretraining (2013 to 2018)

In natural language processing, pre-training began with [word embeddings](/wiki/word_embedding). Word2Vec, introduced by Tomas Mikolov et al. at Google in 2013, trained shallow neural networks on large text corpora to produce dense vector representations of words. GloVe (Global Vectors for Word Representation), developed at Stanford in 2014, used word co-occurrence statistics to learn similar embeddings. Both Word2Vec and GloVe produced static, context-independent embeddings, meaning a word like "bank" would have the same vector regardless of whether it referred to a financial institution or a river bank.

ELMo (Embeddings from Language Models), released by the Allen Institute for AI in February 2018, addressed this limitation. ELMo used a deep bidirectional [LSTM](/wiki/long_short-term_memory_lstm) pre-trained on a corpus of one billion words to produce context-dependent word representations. Unlike Word2Vec and GloVe, ELMo generated different embeddings for the same word depending on its surrounding context. ELMo's pre-trained representations improved performance on a wide range of NLP benchmarks and marked the transition from pre-trained word vectors to pre-trained full models in NLP.

### The transformer revolution (2018 to present)

The release of [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers) (Bidirectional Encoder Representations from Transformers) by Google in October 2018 and [GPT](/wiki/gpt_generative_pre-trained_transformer) (Generative Pre-trained Transformer) by OpenAI in June 2018 established the modern paradigm of large-scale pre-training on unlabeled text using [transformer](/wiki/transformer) architectures [5][6].

BERT used two pre-training objectives: masked language modeling (MLM), where 15% of input [tokens](/wiki/token) are randomly masked and the model predicts them from bidirectional context, and next sentence prediction (NSP), where the model determines whether two sentences are consecutive [5]. BERT was pre-trained on English Wikipedia (about 2.5 billion words) and the BookCorpus dataset (about 800 million words). The BERT-Base model contained 110 million [parameters](/wiki/parameter), while BERT-Large contained 340 million parameters [5].

The GPT series took a different approach, using autoregressive (left-to-right) [language modeling](/wiki/language_model) as its pre-training objective [6]. The progression from GPT-1 to GPT-3 demonstrated the power of scaling:

| Model | Year | Parameters | Pre-training data | Pre-training objective |
|---|---|---|---|---|
| GPT-1 | 2018 | 117 million | BookCorpus | Autoregressive LM |
| [GPT-2](/wiki/gpt2) | 2019 | 1.5 billion | WebText (40 GB) | Autoregressive LM |
| [GPT-3](/wiki/gpt3) | 2020 | 175 billion | Common Crawl + others (570 GB) | Autoregressive LM |
| [GPT-4](/wiki/gpt4) | 2023 | Undisclosed (rumored ~1.7 trillion) | Undisclosed | Autoregressive LM + RLHF |

This era also produced a wide range of other pre-trained transformer models, including [RoBERTa](/wiki/roberta), [ALBERT](/wiki/albert), [XLNet](/wiki/xlnet), [ELECTRA](/wiki/electra), [DeBERTa](/wiki/deberta), and [T5](/wiki/t5).

## What are the main pre-training methods?

### Supervised pre-training

In supervised pre-training, a model is trained on a large labeled dataset before being adapted to a target task. The classic example is training a CNN on the ImageNet classification task (1,000 object categories across about 1.2 million training images) and then transferring the learned features to a new task [3]. Supervised pre-training works well when a large, high-quality labeled dataset is available for the source task. However, creating such labeled datasets is expensive, and supervised pre-training can introduce biases present in the source labels.

### Unsupervised pre-training

Unsupervised pre-training trains models to learn representations from unlabeled data by reconstructing or predicting parts of the input. Traditional approaches include [autoencoders](/wiki/variational_autoencoder), which compress input data into a lower-dimensional representation and then reconstruct it, and clustering algorithms like [k-means](/wiki/k-means). Unsupervised pre-training was historically used as an initialization strategy for deep networks before techniques like [batch normalization](/wiki/batch_normalization) and better [activation functions](/wiki/activation_function) made training deep networks from scratch more feasible.

### Self-supervised pre-training

Self-supervised pre-training generates supervisory signals from the data itself, without requiring human-provided labels. This approach has become the dominant pre-training paradigm in both NLP and computer vision [13].

In NLP, self-supervised objectives include:

- **Masked language modeling (MLM):** Used by BERT and its variants. The model masks a portion of input tokens and learns to predict them from context [5].
- **Autoregressive language modeling:** Used by the GPT series. The model predicts the next token given all preceding tokens [6].
- **Denoising objectives:** Used by T5 and BART. The model reconstructs text from corrupted input.

In computer vision, self-supervised methods include:

- **Contrastive learning:** Methods like SimCLR (Chen et al., 2020) and MoCo (He et al., 2020) train models to recognize that different augmented views of the same image should have similar representations, while views from different images should be dissimilar.
- **Masked image modeling:** Masked Autoencoders (MAE), introduced by He et al. in 2022, apply the masked prediction concept from NLP to images [8]. The model masks random patches of an input image and learns to reconstruct the missing pixels. MAE was directly inspired by BERT's masked token prediction.
- **Self-distillation:** [DINO](/wiki/dino_model) (Caron et al., 2021) trains a student network to match the output of a teacher network (an exponential moving average of the student), producing strong visual representations without labels.

### Contrastive language-image pre-training

CLIP (Contrastive Language-Image Pre-training), introduced by OpenAI in February 2021, represents a multimodal pre-training approach [9]. CLIP jointly trains an image encoder (either a [ResNet](/wiki/resnet) or a [Vision Transformer](/wiki/deit)) and a text encoder (a 12-layer transformer) on 400 million image-text pairs collected from the internet. The model learns to match images with their corresponding text descriptions using a contrastive objective. CLIP's pre-trained representations enable zero-shot classification on new image categories without any task-specific fine-tuning, simply by comparing image embeddings with text embeddings of category descriptions [9].

## How do you use a pre-trained model?

Once a model has been pre-trained, it must be adapted to a specific downstream task. Several strategies exist, ranging from simple to sophisticated. The two foundational options are **feature extraction** (freeze the pre-trained model and train only a new head) and **fine-tuning** (continue training some or all of the pre-trained weights on the new task).

### Feature extraction

The simplest adaptation approach treats the pre-trained model as a fixed [feature extractor](/wiki/feature_extraction). The pre-trained model's weights are frozen, and only a new output layer (or a small classifier head) is trained on the target dataset. This approach works well when the target dataset is small and the pre-training domain is similar to the target domain. In computer vision, this often means extracting features from one of the later [layers](/wiki/layer) of a pre-trained CNN and training a linear classifier or small neural network on top.

### Full fine-tuning

Full fine-tuning updates all parameters of the pre-trained model on the target task's data. The model is typically initialized with the pre-trained weights and trained with a lower [learning rate](/wiki/learning_rate) than would be used for training from scratch. This is the most expressive adaptation method, allowing the model to adjust all of its representations to the target task. However, full fine-tuning is computationally expensive for very large models and can lead to [overfitting](/wiki/overfitting) when the target dataset is small.

### Gradual unfreezing

A middle ground between feature extraction and full fine-tuning, gradual unfreezing starts by training only the new output layer while keeping all pre-trained layers frozen. Then, layers are progressively unfrozen from top to bottom, allowing deeper layers to be adjusted only after higher layers have adapted to the new task. This technique, popularized by Howard and Ruder (2018) in ULMFiT, helps prevent catastrophic forgetting of the pre-trained representations.

### Parameter-efficient fine-tuning (PEFT)

As pre-trained models have grown to billions of parameters, full fine-tuning has become impractical for many users. Parameter-efficient fine-tuning methods update only a small subset of model parameters while keeping the rest frozen.

| Method | Year | Approach | Parameter reduction |
|---|---|---|---|
| Adapter tuning | 2019 | Insert small bottleneck modules between transformer layers | ~3-4% of original parameters |
| Prefix tuning | 2021 | Prepend trainable vectors to each transformer layer's input | ~0.1% of original parameters |
| [LoRA](/wiki/fine_tuning) (Low-Rank Adaptation) | 2021 | Decompose weight updates into low-rank matrices | Up to 10,000x fewer trainable parameters |
| QLoRA | 2023 | Combine LoRA with 4-bit [quantization](/wiki/quantization) | Further memory reduction over LoRA |
| DoRA | 2024 | Decompose weights into magnitude and direction for LoRA | Similar to LoRA with improved accuracy |

LoRA, proposed by Hu et al. in 2021, has become the most widely used PEFT method [10]. Instead of updating the full weight matrix of a pre-trained model, LoRA freezes the original weights and injects trainable low-rank decomposition matrices into each layer. According to the paper, "compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times" [10]. In some cases, LoRA has even matched or outperformed full fine-tuning because it avoids catastrophic forgetting of pre-trained knowledge.

### Prompt tuning and in-context learning

For very large language models, adaptation can be achieved without modifying any model parameters at all. Prompt tuning appends a small set of learnable "soft" tokens to the input, while in-context learning (popularized by GPT-3) provides task demonstrations directly in the input prompt [7]. These approaches enable a single pre-trained model to perform many different tasks without any weight updates.

## What is the difference between feature extraction and fine-tuning?

Feature extraction and fine-tuning are the two ends of the adaptation spectrum, and the right choice depends mainly on the size of the target dataset and how close it is to the pre-training domain.

| Dimension | Feature extraction | Fine-tuning |
|---|---|---|
| Pre-trained weights | Frozen | Updated (some or all layers) |
| What is trained | Only a new head or classifier | The head plus part or all of the backbone |
| Compute and memory | Low | Higher (full fine-tuning is the most costly) |
| Risk of overfitting | Lower on small data | Higher on small data |
| Best when | Target data is small and similar to pre-training data | Target data is larger or the domain differs |
| Typical result | Fast baseline | Higher ceiling accuracy |

In practice, parameter-efficient methods such as LoRA blur this line: they keep the original weights frozen like feature extraction, yet adapt the network's behavior like fine-tuning, capturing much of fine-tuning's accuracy at a fraction of the cost [10].

## What is the difference between pre-training and fine-tuning?

Pre-training and fine-tuning are the two stages of the same workflow, distinguished by their data, objective, scale, and who performs them.

| Aspect | Pre-training | Fine-tuning |
|---|---|---|
| Data | Large, broad, often unlabeled (web text, ImageNet) | Small, narrow, usually labeled for one task |
| Objective | Self-supervised or supervised general learning (e.g., next-token or masked-token prediction) | Task-specific loss (e.g., classification accuracy) |
| Compute | Very high (days to months on many accelerators) | Comparatively low (often hours) |
| Frequency | Done once by a model provider | Done many times by downstream users |
| Output | A general-purpose pre-trained (foundation) model | A specialized model for one application |

The value of the pre-trained model is that the expensive first stage is amortized across all of the cheap second stages [13].

## What is a foundation model?

In August 2021, Stanford University's Center for Research on Foundation Models (CRFM) published a report titled "On the Opportunities and Risks of Foundation Models," which introduced the term **foundation model** to describe pre-trained models that are trained on broad data at scale and can be adapted to a wide range of downstream tasks [12]. Examples include BERT, GPT-3, [DALL-E](/wiki/dall_e), and CLIP.

The report argued that, although foundation models are "based on standard deep learning and transfer learning," their scale "results in new emergent capabilities" not present in smaller models [12]. Their effectiveness across many tasks also incentivizes homogenization, where a few large foundation models serve as the base for a wide variety of applications. The report cautions that this concentration is double-edged: "the defects of the foundation model are inherited by all the adapted models downstream" [12]. This shift has significant implications for AI research, deployment, and governance.

Foundation models now span multiple modalities:

| Domain | Example models | Pre-training data |
|---|---|---|
| Text | [GPT-4](/wiki/gpt4), [Claude](/wiki/claude), [Llama](/wiki/llama), [Mistral](/wiki/mistral) | Trillions of text tokens from the web, books, and code |
| Images | [Stable Diffusion](/wiki/stable_diffusion), [DALL-E](/wiki/dall_e), [Imagen](/wiki/imagen) | Billions of image-text pairs |
| Code | [Codex](/wiki/openai_codex), [StarCoder](/wiki/starcoder), [Code Llama](/wiki/code_llama) | Billions of lines of source code |
| Multimodal | [Gemini](/wiki/gemini), GPT-4V, CLIP | Text, images, audio, and video jointly |
| Audio/Speech | [Whisper](/wiki/whisper), [Wav2Vec](/wiki/wav2vec) | Hundreds of thousands of hours of audio |

## How much data and compute does pre-training need?

Research into scaling laws has provided guidance on how to allocate compute budgets for pre-training. Kaplan et al. (2020) at OpenAI published the first [scaling laws](/wiki/scaling_laws_paper), observing that model performance improves predictably as a power law with increases in model size, dataset size, and compute.

The Chinchilla scaling laws, published by Hoffmann et al. at DeepMind in 2022, refined these findings [11]. By training over 400 language models ranging from 70 million to 16 billion parameters on 5 to 500 billion tokens, the researchers found that the model size and the number of training tokens should be scaled equally for compute-optimal training. The optimal ratio was approximately 20 tokens per parameter [11].

This finding had significant practical impact. The [Chinchilla](/wiki/chinchilla_scaling) model (70 billion parameters trained on 1.4 trillion tokens) outperformed much larger models like Gopher (280 billion parameters) and GPT-3 (175 billion parameters) that had been trained on fewer tokens relative to their size [11]. The Chinchilla-optimal ratio of approximately 20 tokens per parameter became an industry benchmark, though subsequent models like Llama 3 (trained with over 200 tokens per parameter) have shown that "over-training" beyond this ratio can be worthwhile when inference cost matters more than training cost.

## Challenges and limitations

### Catastrophic forgetting

When a pre-trained model is fine-tuned on a new task, it can lose knowledge acquired during pre-training. This phenomenon, known as catastrophic forgetting, is a fundamental challenge in [continual learning](/wiki/continual_learning). Fine-tuning large language models on specific domains frequently degrades their performance on previously learned tasks. Techniques to mitigate this include elastic weight consolidation (EWC), progressive memory banks, rehearsal-based methods that replay examples from earlier tasks, and parameter-efficient fine-tuning methods like LoRA [10].

### Domain shift

Pre-trained models may not transfer well when there is a large gap between the pre-training domain and the target domain. For example, a model pre-trained on natural images may perform poorly when applied to medical imaging or satellite imagery without significant adaptation. The degree of domain shift determines how much fine-tuning data and compute are needed for effective transfer.

### Bias and fairness

Pre-trained models inherit biases present in their training data. Since large-scale pre-training datasets are typically collected from the internet, they reflect societal biases related to gender, race, and other protected attributes. These biases can propagate to all downstream applications that build on the pre-trained model [12]. Addressing bias in pre-trained models requires careful dataset curation, debiasing techniques during or after training, and ongoing evaluation on fairness benchmarks.

### Computational and environmental cost

Pre-training large models requires substantial computational resources. Training GPT-3 (175 billion parameters) was estimated to cost several million dollars and consumed significant energy. As models continue to scale, the environmental impact of pre-training has become a growing concern. This has motivated research into more efficient pre-training methods, [model compression](/wiki/quantization), and the sharing of pre-trained models through public repositories.

### Security and misuse

Pre-trained models can be exploited for malicious purposes, including generating misinformation, deepfakes, or harmful code. The open release of powerful pre-trained models requires balancing the benefits of open research with the risks of misuse. Some organizations have adopted staged release strategies (as OpenAI did with GPT-2) or restricted access through APIs.

## How are large pre-trained models compressed for deployment?

Deploying large pre-trained models in production environments often requires reducing their size and computational requirements. The main compression techniques include:

- **Pruning:** Systematically removing less important weights or neurons from the model. Structured pruning removes entire filters or attention heads, while unstructured pruning removes individual weights. NVIDIA has developed methods combining structured weight pruning with [knowledge distillation](/wiki/knowledge_editing) to compress LLMs without significant quality loss.
- **Quantization:** Reducing the numerical precision of model weights from 32-bit floating point (FP32) to lower-precision formats like 8-bit integer (INT8) or 4-bit. This reduces memory usage and speeds up inference, particularly on specialized hardware. Post-training quantization can be applied to any pre-trained model without retraining.
- **Knowledge distillation:** Training a smaller "student" model to mimic the outputs of a larger "teacher" model. The student learns to reproduce not just the teacher's final predictions but also its intermediate representations. DistilBERT, for example, is a distilled version of BERT that retains 97% of BERT's language understanding while being 40% smaller and 60% faster [14].

## Where can you find pre-trained models?

The widespread adoption of pre-trained models has led to the creation of dedicated repositories and hubs where researchers and practitioners can discover, download, and share models.

| Repository | Maintained by | Notable features |
|---|---|---|
| [Hugging Face](/wiki/hugging_face) Hub | Hugging Face | More than 2 million models (2025); supports [PyTorch](/wiki/pytorch), [TensorFlow](/wiki/tensorflow), JAX; model cards and community discussions [15] |
| [TensorFlow](/wiki/tensorflow) Hub | Google | Pre-trained models for TF ecosystem; SavedModel format; easy integration with Keras |
| PyTorch Hub | Meta | Research-focused; reproducing published results; tight integration with PyTorch |
| [Kaggle](/wiki/data_set_or_dataset) Models | Google | Models with competition datasets; notebook integration |
| NVIDIA NGC | NVIDIA | Optimized models for NVIDIA GPUs; enterprise-grade containers |

Hugging Face has emerged as the largest and most widely used model hub. By 2025 the Hub hosted more than 2 million models, having taken over 1,000 days to reach its first million models and only about 335 days to reach the second million [15]. Its Transformers library provides a unified API for loading and using pre-trained models across different frameworks, and the platform hosts models from individual researchers, academic labs, and companies including Meta, Mistral, DeepSeek, and Stability AI.

## Applications

Pre-trained models are used across virtually every area of applied AI:

- **Natural language processing:** Pre-trained [large language models](/wiki/large_language_model) power chatbots, [machine translation](/wiki/machine_translation), [text summarization](/wiki/text_summarization), [sentiment analysis](/wiki/sentiment_analysis), question answering, and code generation. Models like GPT-4 and Claude can perform these tasks through in-context learning without any fine-tuning [7].
- **Computer vision:** ImageNet-pretrained CNNs and Vision Transformers serve as backbones for object detection, image segmentation, medical image analysis, autonomous driving perception, and satellite imagery interpretation.
- **Speech and audio:** Pre-trained models like Whisper (OpenAI) and Wav2Vec 2.0 (Meta) enable [speech recognition](/wiki/speech_recognition) across dozens of languages, speaker identification, and audio classification.
- **Healthcare:** Pre-trained models are adapted for medical image interpretation (X-rays, CT scans, histopathology), drug discovery, protein structure prediction (AlphaFold), and clinical note analysis.
- **Autonomous driving:** Self-driving systems use pre-trained vision models for perception tasks and, more recently, pre-trained vision-language-action (VLA) models for end-to-end driving decisions.
- **Robotics:** Robots use pre-trained language and vision models to understand instructions, perceive their environments, and plan actions. Tesla's Optimus humanoid robot, for example, trains extensively in simulation using pre-trained models before transferring learned behaviors to physical hardware.
- **Scientific research:** Pre-trained models are applied to protein folding, climate modeling, materials discovery, and genomics, often by fine-tuning general-purpose models on domain-specific scientific data.

## See also

- [Transfer learning](/wiki/transfer_learning)
- [Fine-tuning](/wiki/fine_tuning)
- [Foundation model](/wiki/large_language_model)
- [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers)
- [GPT](/wiki/gpt_generative_pre-trained_transformer)
- [Self-supervised learning](/wiki/self-supervised_learning)
- [Feature extraction](/wiki/feature_extraction)
- [Transformer](/wiki/transformer)
- [Convolutional neural network](/wiki/convolutional_neural_network)

## References

1. Bozinovski, S. (1976). "Influence of pattern similarity and transfer learning upon the training of a base perceptron." *Proceedings of Symposium Informatica*, 3-121-5.
2. Pan, S. J., & Yang, Q. (2009). "A survey on transfer learning." *IEEE Transactions on Knowledge and Data Engineering*, 22(10), 1345-1359.
3. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet classification with deep convolutional neural networks." *Advances in Neural Information Processing Systems*, 25.
4. Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). "How transferable are features in deep neural networks?" *Advances in Neural Information Processing Systems*, 27. arXiv:1411.1792.
5. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *arXiv preprint arXiv:1810.04805*.
6. Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." OpenAI.
7. Brown, T. B., et al. (2020). "Language Models are Few-Shot Learners." *Advances in Neural Information Processing Systems*, 33. arXiv:2005.14165.
8. He, K., Chen, X., Xie, S., Li, Y., Dollar, P., & Girshick, R. (2022). "Masked Autoencoders Are Scalable Vision Learners." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*. arXiv:2111.06377.
9. Radford, A., Kim, J. W., Hallacy, C., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision." *Proceedings of the 38th International Conference on Machine Learning*. arXiv:2103.00020.
10. Hu, E. J., Shen, Y., Wallis, P., et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." *arXiv preprint arXiv:2106.09685*.
11. Hoffmann, J., et al. (2022). "Training Compute-Optimal Large Language Models." *Advances in Neural Information Processing Systems*, 35. arXiv:2203.15556.
12. Bommasani, R., et al. (2021). "On the Opportunities and Risks of Foundation Models." *arXiv preprint arXiv:2108.07258*. Stanford CRFM.
13. Han, X., et al. (2021). "Pre-trained models: Past, present and future." *AI Open*, 2, 225-250.
14. Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." *arXiv preprint arXiv:1910.01108*.
15. Hugging Face (2025). "Hugging Face Hub documentation" and Hub model statistics (the Hub surpassed 2 million hosted models in 2025). huggingface.co/docs/hub.

