# Transfer Learning

> Source: https://aiwiki.ai/wiki/transfer_learning
> Updated: 2026-06-20
> Categories: Deep Learning, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

**Transfer learning** is a [machine learning](/wiki/machine_learning) technique that reuses knowledge a model has gained on one task or domain to improve performance on a different but related task or domain, instead of training a new model from scratch each time. By leveraging representations, parameters, or features learned from large datasets, transfer learning reduces the data, time, and compute needed to reach strong results [15]. It is the dominant paradigm in modern [deep learning](/wiki/deep_learning), underpinning breakthroughs in [computer vision](/wiki/computer_vision), [natural language processing](/wiki/natural_language_processing), speech recognition, and many other fields. The foundational reference, Pan and Yang's 2010 "A Survey on Transfer Learning," is one of the most cited works in the field, with more than 20,000 citations [1], and the paradigm produced results such as [BERT](/wiki/bert) pushing the GLUE benchmark score to 80.5% in 2019 [10] and [CLIP](/wiki/clip) matching a fully supervised ResNet-50 on [ImageNet](/wiki/imagenet) using none of its 1.28 million labeled training images [20].

Training a large [neural network](/wiki/neural_network) on a broad dataset and then adapting it to a specialized downstream task is now standard in both computer vision and natural language processing. The approach is especially valuable when labeled data for the target task is scarce or expensive to collect, because a pre-trained model already captures useful statistical regularities from the source data.

## ELI5 (Explain like I'm 5)

Imagine you already know how to ride a bicycle. When you try to learn how to ride a motorcycle, you do not have to start from zero because you already understand balance, steering, and braking. The skills you learned on the bicycle "transfer" to the motorcycle, making it much easier and faster to learn. Transfer learning works the same way for computers. A computer that has already learned to recognize thousands of objects in photos can use that knowledge to quickly learn a new task, like identifying specific bird species, without needing millions of new training examples.

## Historical background

### Early foundations

The intellectual roots of transfer learning trace back to research on inductive transfer, inductive bias, and "learning to learn" in the early 1990s. In 1992, Lorien Pratt formulated the discriminability-based transfer (DBT) algorithm, one of the first explicit algorithms for transferring knowledge between neural networks trained on different tasks. Pratt published further early work on transfer between [neural network](/wiki/neural_network) tasks in 1993. The topic gained wider attention at the NIPS 1995 workshop titled "Learning to Learn: Knowledge Consolidation and Transfer in Inductive Systems," held in Vail, Colorado and organized by Sebastian Thrun and others. This workshop brought together researchers interested in how a learner could exploit experience on previous tasks to improve performance on new ones.

Rich Caruana's 1997 paper on multi-task learning was another foundational contribution. Caruana framed MTL as "an approach to inductive transfer that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias," and demonstrated that training a neural network on multiple related tasks simultaneously, using shared hidden-layer representations, could improve generalization on each individual task [3]. This work established the principle that shared representations capture useful inductive biases. The following year, Sebastian Thrun and Lorien Pratt edited the book *Learning to Learn* (1998), which collected foundational work on the subject and helped crystallize transfer learning as a recognized research direction within machine learning [2].

### The Pan and Yang survey (2010)

A pivotal moment in the field came with the publication of "A Survey on Transfer Learning" by Sinno Jialin Pan and Qiang Yang in *IEEE Transactions on Knowledge and Data Engineering* (2010) [1]. This survey provided a comprehensive taxonomy of transfer learning settings for classification, regression, and clustering, classifying transfer learning into inductive, transductive, and unsupervised categories [1]. Pan and Yang formalized the distinction between different transfer learning scenarios based on differences in domains and tasks between source and target, and they discussed the relationship between transfer learning and related topics such as domain adaptation, multi-task learning, sample selection bias, and covariate shift [1]. The paper became one of the most cited works in the entire field of machine learning, accumulating more than 20,000 citations, and it remains a standard reference for researchers entering the area [1].

### The deep learning era

The deep learning revolution that began around 2012, catalyzed by the success of AlexNet on the [ImageNet](/wiki/imagenet) Large Scale Visual Recognition Challenge (ILSVRC), made transfer learning practical at scale. Once researchers demonstrated that deep [convolutional neural networks](/wiki/convolutional_neural_network) trained on ImageNet learned general-purpose visual features, the practice of pre-training on ImageNet and fine-tuning on downstream tasks quickly became the default approach in computer vision. Jason Yosinski and colleagues published an influential 2014 study quantifying how transferable different neural network layers are, finding that early layers learn general features (edges, textures) while later layers learn task-specific features [4].

In natural language processing, transfer learning took longer to mature. Word embedding methods like Word2Vec (2013) and GloVe (2014) represented an early form of transfer, providing pre-trained word representations. The real transformation came in 2018 with three landmark contributions: [ELMo](/wiki/elmo) (Peters et al.), which introduced contextualized word embeddings [7]; ULMFiT (Howard and Ruder), which demonstrated effective language model [fine tuning](/wiki/fine_tuning) for text classification [8]; and [BERT](/wiki/bert) (Devlin et al.), which established pre-training and fine-tuning as the standard NLP paradigm [10]. [GPT](/wiki/gpt) (Radford et al., 2018) and its successors further demonstrated the power of large-scale language model pre-training [9].

## Formal definition

Following the framework of Pan and Yang (2010), transfer learning can be defined more precisely using the concepts of domain and task.

A **domain** D consists of a feature space X and a marginal probability distribution P(X), where X = {x_1, x_2, ..., x_n} belongs to X. A **task** T, given a domain D, consists of a label space Y and a predictive function f(.) that is learned from training data consisting of pairs {x_i, y_i}.

Given a source domain D_S with a corresponding source task T_S, and a target domain D_T with a corresponding target task T_T, transfer learning aims to improve the learning of the target predictive function f_T(.) in D_T using the knowledge in D_S and T_S, where D_S is not equal to D_T or T_S is not equal to T_T [1]. In other words, transfer learning applies whenever the source and target differ in their feature spaces, their marginal distributions, their label spaces, or their conditional distributions.

## What are the main types of transfer learning?

Pan and Yang's 2010 taxonomy divides transfer learning into three main categories based on the relationship between source and target domains and tasks [1].

| Type | Source labels | Target labels | Domain relationship | Task relationship | Typical methods |
|---|---|---|---|---|---|
| Inductive | Available or not | Required | Same or different | Different | Multi-task learning, self-taught learning, fine-tuning |
| Transductive | Available | Not available | Different | Same | Domain adaptation, sample selection bias correction |
| Unsupervised | Not available | Not available | Different | Different but related | Transfer clustering, transfer dimensionality reduction |

### Inductive transfer learning

In inductive transfer learning, the target task differs from the source task regardless of whether the domains are the same or different. Some labeled data in the target domain is required. The primary goal is to improve the performance of the model on the target task by utilizing the knowledge gained from the source task. There are two sub-cases:

- **With labeled source data:** The source domain has abundant labeled data. The model learns a shared representation from the source and adapts it to the target. This setting resembles [multi-task learning](/wiki/multi-task_learning) when both tasks are learned simultaneously.
- **Without labeled source data (self-taught learning):** The source domain has only unlabeled data. The model learns useful representations (for example, through unsupervised pre-training) that transfer to the target task.

Modern fine-tuning of [pre-trained model](/wiki/pre-trained_model) weights on a new task is the most common example of inductive transfer.

### Transductive transfer learning

In transductive transfer learning, the source and target tasks are the same, but the domains differ. Labeled data exists only in the source domain. This setting encompasses [domain adaptation](/wiki/domain_adaptation) (where marginal probability distributions differ between source and target) and cross-domain transfer. For example, a sentiment classifier trained on electronics product reviews might be adapted to work on book reviews or movie reviews, where the vocabulary and expression patterns are different but the underlying task is the same.

### Unsupervised transfer learning

Unsupervised transfer learning addresses scenarios where no labeled data is available in either domain. The focus is on unsupervised tasks such as clustering or dimensionality reduction, and knowledge from a source domain's unlabeled data is transferred to help with an unsupervised learning task in a target domain. This approach often involves unsupervised [feature extraction](/wiki/feature_extraction) and unsupervised domain adaptation techniques.

## How does feature extraction differ from fine-tuning?

Two primary strategies dominate practical transfer learning: [feature extraction](/wiki/feature_extraction) and fine-tuning. These strategies differ in how much of the pre-trained model is adapted to the new task.

### Feature extraction

In the feature extraction approach, the pre-trained model serves as a fixed feature extractor (a fixed encoder). The model's weights are frozen (not updated during training), and its intermediate or final layer outputs (activations) are used as input features for a new classifier or regression model trained on the target task. This approach works well when the target dataset is small and the source domain is sufficiently similar to the target domain. Because the pre-trained weights are never modified, there is minimal risk of [overfitting](/wiki/overfitting) to a small target dataset, and only the new classifier's parameters are updated during training. However, because the features are fixed, the model cannot adapt its internal representations to the specific nuances of the target task.

For example, a ResNet-50 model pre-trained on ImageNet can have its final classification layer removed, and the 2048-dimensional output of its penultimate layer can serve as a rich feature vector for a new [image recognition](/wiki/image_recognition) task. A simple linear classifier trained on these features often achieves surprisingly strong performance.

### Fine-tuning

[Fine-tuning](/wiki/fine_tuning) involves initializing a model with pre-trained weights and then continuing training (with a typically smaller learning rate) on the target dataset. Depending on the amount of target data and the similarity between source and target domains, practitioners may fine-tune all layers of the model or only a subset. This allows the model to adapt its learned representations to the specific characteristics of the new task and domain. Fine-tuning generally outperforms feature extraction when sufficient target data is available and when the source and target domains are somewhat different. However, it carries a higher risk of overfitting, especially when the target dataset is very small. Careful learning rate scheduling, regularization, and strategies like progressive unfreezing help mitigate this risk.

A common strategy is to freeze the early layers of the network (which tend to learn general features like edges and textures) and only fine-tune the later layers (which learn more abstract, task-specific features). Progressive unfreezing, where layers are gradually unfrozen from top to bottom during training, is another effective technique that helps prevent catastrophic forgetting of previously learned representations.

A landmark study by Yosinski et al. (2014), "How Transferable Are Features in Deep Neural Networks?," systematically investigated the generality and specificity of features at different layers of a deep network [4]. The authors found that early layers learn general features (such as edge detectors and color blobs) that are broadly useful across tasks, while later layers learn increasingly task-specific features [4]. They also demonstrated that initializing a network with transferred features from almost any number of layers produces a boost to generalization, even after fine-tuning on the target dataset [4]. This work provided foundational empirical evidence for the effectiveness of both feature extraction and fine-tuning strategies.

### Comparison of fine-tuning strategies

| Strategy | Weights Updated | Compute Cost | Best When | Risk |
|---|---|---|---|---|
| Feature extraction | Only new classifier head | Low | Small target dataset, similar domains | Under-fitting on dissimilar targets |
| Full fine-tuning | All layers | High | Large target dataset, dissimilar domains | Overfitting on small datasets |
| Partial fine-tuning | Top N layers + classifier | Medium | Moderate target data | Balances adaptation and stability |
| Progressive unfreezing | Layers unfrozen gradually | Medium-High | Medium target data, risk of catastrophic forgetting | Requires tuning of schedule |

### When to use each approach

| Scenario | Target data size | Domain similarity | Recommended strategy |
|---|---|---|---|
| Small dataset, similar domain | Small (hundreds) | High | Feature extraction |
| Small dataset, different domain | Small (hundreds) | Low | Feature extraction from early layers, or fine-tune with caution |
| Large dataset, similar domain | Large (thousands+) | High | Fine-tune all layers |
| Large dataset, different domain | Large (thousands+) | Low | Fine-tune all layers, possibly from scratch |

## Transfer learning in computer vision

Computer vision was the first field where deep transfer learning became standard practice. The [ImageNet](/wiki/imagenet) dataset, containing over 1.2 million images across 1,000 categories (and over 14 million images across more than 20,000 categories in its broader form), served as the foundation for training models whose representations proved broadly useful.

### ImageNet pre-training

After [AlexNet](/wiki/alexnet) won the ILSVRC 2012 competition, researchers quickly discovered that the features learned by deep CNNs on ImageNet were remarkably general and could be repurposed for tasks far removed from ImageNet classification. Two influential 2014 papers cemented this finding:

- **DeCAF** (Donahue et al., 2014): Evaluated deep convolutional activation features trained on ImageNet for novel recognition tasks including scene recognition, domain adaptation, and fine-grained recognition [5]. The work demonstrated that features from intermediate layers of a CNN trained on ImageNet transferred effectively to a wide variety of visual tasks [5].
- **CNN Features Off-the-Shelf** (Razavian et al., 2014): Showed that features extracted from a pre-trained CNN, with no task-specific modification, provided an "astounding baseline" for a broad range of recognition problems [6]. This paper popularized the practice of using pre-trained CNN features as a generic starting point for vision tasks [6].

Today, pre-training on ImageNet (or larger datasets like ImageNet-21k, [LAION](/wiki/laion)-5B, or JFT-300M) remains the standard initialization strategy for vision models.

### Backbone freezing

Backbone freezing is the practice of keeping the convolutional layers (the "backbone") of a pre-trained model fixed and training only a new task-specific head (for example, a fully connected classification layer). The rationale comes from the hierarchical nature of feature learning in CNNs: early layers capture universal visual features like edges and textures, while later layers develop task-specific representations.

Freezing the backbone preserves these general-purpose features, prevents overfitting on small target datasets, and can reduce GPU memory consumption by up to 28% compared to full fine-tuning. It is especially effective when the target dataset is small and visually similar to the source domain.

### Progressive unfreezing

Progressive unfreezing is a training strategy in which layers of a pre-trained model are gradually unfrozen from the top (closest to the output) to the bottom (closest to the input) over the course of training. The approach was popularized by Howard and Ruder (2018) in the context of NLP (see ULMFiT below), but it is equally applicable to vision models [8].

The procedure typically follows these steps:

1. Freeze the entire pre-trained backbone and train only the new task-specific head for several epochs.
2. Unfreeze the top layer group of the backbone and continue training with a reduced learning rate.
3. Progressively unfreeze additional layer groups, each time lowering the learning rate further.
4. Optionally, fine-tune the entire model end-to-end with a very small learning rate.

Progressive unfreezing helps prevent catastrophic forgetting (the phenomenon where fine-tuning destroys previously learned features) and yields minimal accuracy loss (often less than 1%) compared to full fine-tuning while providing substantial reductions in compute, memory, and training time.

### Common pre-trained CNN architectures

Several architectures trained on ImageNet became standard starting points for transfer learning:

| Architecture | Year | Pre-Training Data | Key Feature / Innovation | Parameters (approx.) | Typical Transfer Use |
|---|---|---|---|---|---|
| [AlexNet](/wiki/alexnet) | 2012 | [ImageNet](/wiki/imagenet) | First large-scale deep CNN | ~60M | Historical; rarely used today |
| [VGGNet](/wiki/vgg) (VGG-16/19) | 2014 | [ImageNet](/wiki/imagenet) | Deep stacking of 3x3 convolutions; simple, uniform architecture | 138M | Feature extraction baseline |
| [GoogLeNet / Inception](/wiki/inception) | 2014 | [ImageNet](/wiki/imagenet) | Inception modules with parallel convolution paths | 6.8M | Efficient feature extraction |
| [ResNet](/wiki/resnet) (50/101/152) | 2015 | [ImageNet](/wiki/imagenet) | Residual connections enabling very deep networks | 25M-60M | Widely used backbone |
| DenseNet | 2017 | [ImageNet](/wiki/imagenet) | Dense connectivity between layers | 8M-20M | Strong classifier backbone |
| [EfficientNet](/wiki/efficientnet) | 2019 | [ImageNet](/wiki/imagenet) | Compound scaling of depth, width, resolution | 5M-66M | High accuracy, efficient |
| [Vision Transformer (ViT)](/wiki/vision_transformer) | 2020 | [ImageNet](/wiki/imagenet) / JFT-300M | Transformer applied to image patches | 86M-632M | Strong with large data |
| [Swin Transformer](/wiki/swin_transformer) | 2021 | [ImageNet](/wiki/imagenet) | Shifted window attention | Varies | State-of-the-art backbone |
| [ConvNeXt](/wiki/convnext) | 2022 | [ImageNet](/wiki/imagenet) | Modernized ConvNet | Varies | Competitive with transformers |

Yosinski et al. (2014) showed that the first layers of these networks learn general visual features (Gabor-like filters, color blobs) that transfer well across tasks, while higher layers become progressively more task-specific [4]. This finding provided theoretical justification for the practice of freezing early layers while fine-tuning later ones.

### Practical workflow

A typical transfer learning workflow in computer vision involves: (1) selecting a pre-trained model appropriate for the task complexity and available compute; (2) replacing the final classification head with a new head matching the number of target classes; (3) optionally freezing early layers; (4) training on the target dataset with a reduced learning rate (often 10 to 100 times smaller than the original training rate); and (5) evaluating and optionally unfreezing additional layers for further fine-tuning.

Smaller models like MobileNet often outperform larger architectures such as ResNet-152 in low-data regimes because they are less prone to overfitting.

## Transfer learning in natural language processing

Transfer learning transformed NLP beginning in 2018, a year sometimes called the "ImageNet moment" for language. The field shifted from task-specific architectures to a general "pre-train then fine-tune" paradigm.

### Word embeddings as early transfer

Before the era of pre-trained language models, word [embeddings](/wiki/embeddings) such as [Word2Vec](/wiki/word2vec) (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) represented a rudimentary form of transfer learning. These embeddings were trained on large unlabeled corpora and then used as input features for downstream NLP models. However, word embeddings are static (each word receives a single vector regardless of context), which limits their transferability.

### ELMo (2018)

[ELMo](/wiki/elmo) (Embeddings from Language Models), introduced by Peters et al. (2018), represented a major step forward [7]. ELMo produces deep contextualized word representations by running a deep bidirectional [LSTM](/wiki/lstm) language model on text and combining the internal states of all layers [7]. Unlike static word embeddings, ELMo representations change depending on the surrounding context, capturing polysemy and syntactic roles. Adding ELMo representations to existing NLP models improved the state of the art across six challenging tasks, including question answering, textual entailment, and sentiment analysis [7].

### ULMFiT (2018): a pivotal method

Universal Language Model Fine-tuning (ULMFiT), proposed by Jeremy Howard and Sebastian Ruder (2018), was one of the first methods to demonstrate that a single general-purpose pre-trained language model could be effectively fine-tuned for any text classification task [8]. ULMFiT introduced three key techniques that became standard practice:

1. **Discriminative fine-tuning.** Different layers of the language model are tuned with different learning rates. Earlier layers, which capture general linguistic features, receive smaller learning rates, while later layers, which capture more task-specific features, receive larger ones.
2. **Slanted triangular learning rates.** The learning rate first increases linearly for a short period to help the model quickly converge to a suitable parameter region for the target task, then decays linearly over a longer period for gradual refinement.
3. **Gradual unfreezing.** Rather than fine-tuning all layers simultaneously, layers are unfrozen one at a time from the last layer backward. This prevents catastrophic forgetting of the general knowledge captured in earlier layers.

The authors reported that ULMFiT "significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18-24% on the majority of datasets" [8]. They further showed that with only 100 labeled examples, ULMFiT matched the performance of training from scratch on 100 times more data, achieving state-of-the-art results across all six datasets [8].

### GPT (2018)

[GPT](/wiki/gpt) (Generative Pre-trained Transformer), introduced by Radford et al. at [OpenAI](/wiki/openai), applied the pre-train-then-fine-tune paradigm to the [Transformer](/wiki/transformer) architecture [9]. GPT was pre-trained as a unidirectional (left-to-right) language model on the BooksCorpus dataset and then fine-tuned on downstream tasks with minimal architectural changes [9]. GPT demonstrated that Transformer-based language models, when pre-trained at sufficient scale, could transfer effectively to tasks including textual entailment, question answering, and semantic similarity [9].

### BERT and the pre-train/fine-tune paradigm

[BERT](/wiki/bert) (Bidirectional Encoder Representations from Transformers), introduced by Devlin et al. at Google, extended the pre-train-then-fine-tune paradigm by using a masked language model objective that considers both left and right context simultaneously [10]. Pre-trained on large text corpora using masked language modeling and next sentence prediction objectives, BERT's bidirectional representations captured rich contextual information [10]. Fine-tuning BERT on a target task typically required only adding a task-specific output layer and training for a few epochs [10]. BERT obtained "new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement)" upon its release, and popularized the pattern of releasing pre-trained models that the community could fine-tune for specific tasks [10].

BERT's success spawned numerous variants (such as [RoBERTa](/wiki/roberta), [ALBERT](/wiki/albert), DistilBERT, [ELECTRA](/wiki/electra), and [DeBERTa](/wiki/deberta)) that improved upon the original through changes to training procedures, model architecture, or efficiency.

### Scaling up: GPT-2, GPT-3, and beyond

Subsequent work by OpenAI scaled up the GPT architecture. [GPT-2](/wiki/gpt) (2019) demonstrated that increasing model size and training data led to emergent capabilities in zero-shot task performance. [GPT-3](/wiki/gpt-3) (2020), with 175 billion parameters, showed that sufficiently large language models could perform tasks through in-context learning (providing task examples in the prompt) without any gradient-based fine-tuning [11]. The GPT-3 paper reported that "scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches" [11]. This represented a new form of transfer where the model's pre-trained knowledge was activated through prompting rather than parameter updates.

### NLP transfer learning timeline

| Model | Year | Architecture | Pre-Training Objective | Transfer Method / Key Contribution |
|---|---|---|---|---|
| [Word2Vec](/wiki/word2vec) | 2013 | Shallow network | Skip-gram / CBOW | Static embeddings as input features |
| GloVe | 2014 | Matrix factorization | Co-occurrence statistics | Static embeddings as input features |
| [ELMo](/wiki/elmo) | 2018 | Bidirectional [LSTM](/wiki/lstm) | Language modeling | Contextualized embeddings added to task model |
| ULMFiT | 2018 | AWD-LSTM | Language modeling | Full model fine-tuning with gradual unfreezing |
| [GPT](/wiki/gpt) | 2018 | [Transformer](/wiki/transformer) (decoder) | Autoregressive LM | Fine-tuning with task-specific head |
| [BERT](/wiki/bert) | 2018 | [Transformer](/wiki/transformer) (encoder) | Masked LM + next sentence prediction | Dominant pre-training approach for NLU tasks |
| GPT-2 / [GPT-3](/wiki/gpt-3) | 2019-2020 | [Transformer](/wiki/transformer) (decoder) | Autoregressive LM | Emergent few-shot, zero-shot, in-context learning |
| [T5](/wiki/t5) | 2019-2020 | [Transformer](/wiki/transformer) (encoder-decoder) | Text-to-text denoising | Unified diverse NLP tasks under a single format |

## Domain adaptation

[Domain adaptation](/wiki/domain_adaptation) is a specific form of transductive transfer learning that addresses the problem of training and test data coming from different distributions while sharing the same task. This is one of the most practically important transfer learning scenarios, as real-world deployment conditions frequently differ from training conditions. Common examples include adapting a model trained on synthetic images to work on real photographs, or adapting an NLP model trained on news text to process social media posts.

### Types of domain adaptation

- **Supervised domain adaptation:** Some labeled target data is available alongside labeled source data. The goal is to combine both data sources to achieve better target performance than using either alone.
- **Semi-supervised domain adaptation:** A small amount of labeled target data is available along with a large amount of unlabeled target data.
- **Unsupervised domain adaptation:** This is the more challenging and more commonly studied setting, where labeled data is available only in the source domain and only unlabeled data is available in the target domain.

### Core techniques

Several families of techniques address domain adaptation:

| Technique | Description | Example Methods |
|---|---|---|
| Instance reweighting | Assigns different weights to source samples to approximate the target distribution | Importance weighting, Kernel Mean Matching |
| Feature alignment | Maps source and target features into a shared representation space | Maximum Mean Discrepancy (MMD), Correlation Alignment (CORAL) |
| Adversarial adaptation | Uses a domain discriminator to encourage domain-invariant features | Domain-Adversarial Neural Networks (DANN), Adversarial Discriminative Domain Adaptation (ADDA) |
| Batch normalization adaptation | Modulates batch normalization statistics from source to target domain | Adaptive Batch Normalization (AdaBN) |
| Self-training / pseudo-labeling | Uses model predictions on unlabeled target data as pseudo-labels for further training | Noisy Student Training |

Adversarial approaches, inspired by generative adversarial networks, train a domain discriminator to distinguish source from target features while the feature extractor is trained to fool the discriminator [16]. This encourages domain-invariant representations. Domain-Adversarial Neural Networks (DANN), introduced by Ganin et al. (2016), are a representative example [16].

## Few-shot and zero-shot learning

[Few-shot learning](/wiki/few-shot_learning) and [zero-shot learning](/wiki/zero-shot_learning) represent extreme forms of transfer, where models must generalize to new tasks or classes with minimal or no task-specific training examples.

### Zero-shot learning

In zero-shot learning, the model must generalize to classes or tasks it has never seen during training. This is made possible by transferring knowledge through shared semantic representations, attributes, or natural language descriptions. For example, a vision model trained to recognize horses and stripes can recognize a zebra by combining these concepts, even if it has never seen a zebra during training. OpenAI's [CLIP](/wiki/clip) model (Radford et al., 2021) demonstrated that a model trained on 400 million image-text pairs could classify images into arbitrary categories described in natural language without any task-specific training data [20]. CLIP achieved 76.2% zero-shot top-1 accuracy on [ImageNet](/wiki/imagenet), matching the performance of the original supervised ResNet-50 while using none of the 1.28 million labeled training examples that model required [20]. Similarly, GPT-3 showed that sufficiently large language models can perform new tasks given only a natural language description of the task [11]. In the context of [large language models](/wiki/large_language_model), zero-shot learning takes the form of prompting: the model uses its pre-trained knowledge to perform a task described in natural language without any task-specific examples.

### Few-shot learning

Few-shot learning involves adapting to new tasks using only a handful of labeled examples (typically 1 to 10 per class). Common approaches include:

- **Metric learning:** Learning an embedding space where examples from the same class are close together (e.g., Prototypical Networks, Matching Networks).
- **[Meta-learning](/wiki/meta-learning):** Training the model to learn how to learn, so it can rapidly adapt to new tasks from few examples. Model-Agnostic Meta-Learning (MAML) exposes models to many small learning tasks during training to make them easily adaptable.
- **In-context learning:** Providing a few examples in the input prompt of a large language model, which then generalizes to new instances without gradient updates (e.g., [GPT-3](/wiki/gpt-3)) [11].

These capabilities arise because large-scale pre-training on diverse data allows models to develop general-purpose representations that capture broad knowledge about the world, which can then be directed toward specific tasks through prompts, instructions, or minimal examples.

## What is negative transfer?

Transfer learning does not always improve performance. When the source and target are sufficiently dissimilar, transferring knowledge from the source can actually degrade performance on the target task. This phenomenon is known as **negative transfer**. Understanding and avoiding negative transfer is a critical practical concern.

### Causes of negative transfer

| Factor | Description | Example |
|---|---|---|
| Domain divergence | Source and target distributions are too dissimilar | Medical X-rays vs. satellite imagery |
| Task conflict | Source task objectives conflict with target task | Sentiment analysis model transferred to topic classification |
| Feature misalignment | Shared features have different meanings across domains | "Bank" meaning financial institution vs. river bank |
| Over-transfer | Too many source parameters are rigidly transferred | Freezing too many layers when domains differ substantially |
| Irrelevant source data | Source data contains patterns specific to the source that are irrelevant to the target | Model relies on spurious patterns and performs worse than training from scratch |
| Insufficient target data for adaptation | With very little target data, fine-tuning may overfit to noise | Tiny labeled target set cannot meaningfully adapt transferred features |
| Mismatched model capacity | Transferring from an overly complex model to a simple target (or vice versa) introduces optimization difficulties | Huge pre-trained backbone applied to a tiny niche task |

### Detecting and mitigating negative transfer

Researchers have developed several strategies to detect and mitigate negative transfer:

- **Source selection:** Carefully choose a source domain or task related to the target, measuring domain similarity with metrics like proxy A-distance or Maximum Mean Discrepancy.
- **Selective transfer:** Identify and transfer only the layers or features most likely to be beneficial, discarding or re-initializing others.
- **Regularization:** Apply regularization to balance between source knowledge and target adaptation.
- **Curriculum-based approaches:** Gradually increase the influence of source knowledge during training.
- **Multi-source transfer:** Combine knowledge from multiple diverse source domains to reduce the risk that any single mismatched source dominates.
- **Validation monitoring:** Monitor performance on a target validation set during training and stop transfer when performance degrades.

Wang et al. (2019) provided a formal characterization of negative transfer in the context of computer vision and proposed methods for avoiding it through careful source-target task relationship modeling [18].

As a general guideline, transfer learning is most likely to help when the source and target share similar low-level features, when the target dataset is small relative to model capacity, and when the source model was trained on a large and diverse dataset [14].

## Multi-task learning and its relationship to transfer learning

[Multi-task learning](/wiki/multi-task_learning) (MTL) is closely related to transfer learning but differs in its simultaneous training objective. While transfer learning typically involves sequential stages (pre-train on source, then adapt to target), multi-task learning trains a model on multiple tasks at the same time using shared representations.

Caruana (1997) described multi-task learning as "an approach to inductive transfer that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias" [3]. In a multi-task setup, related tasks share hidden layers in a neural network, and what is learned for one task can help other tasks be learned more effectively [3].

The key differences between MTL and sequential transfer learning are:

- **Timing:** MTL trains on all tasks simultaneously; sequential transfer learning trains on the source task first and then adapts to the target task.
- **Objective:** MTL aims to improve performance on all tasks jointly; transfer learning typically prioritizes performance on the target task.
- **Data requirements:** MTL requires labeled data for all tasks during training; transfer learning may use unlabeled source data or a source task with different labels.

Despite these differences, the two approaches share the core principle that learning from related tasks provides useful inductive bias. In practice, many modern systems blend both: a model might be pre-trained with multiple auxiliary tasks (MTL) and then fine-tuned on a specific target task (sequential transfer). For example, [T5](/wiki/t5) is pre-trained with a multi-task mixture of unsupervised and supervised objectives and then fine-tuned on individual downstream tasks.

## Foundation models and parameter-efficient fine-tuning

The concept of [foundation models](/wiki/foundation_model), popularized by the Stanford Center for Research on Foundation Models (CRFM) in 2021, represents transfer learning taken to its logical extreme [12]. The CRFM report defined foundation models as "models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks" [12]. Examples include [GPT-4](/wiki/gpt-4), [Claude](/wiki/claude), [Gemini](/wiki/gemini), PaLM, LLaMA, [CLIP](/wiki/clip), DALL-E, and [Stable Diffusion](/wiki/stable_diffusion).

Foundation models differ from earlier transfer learning approaches in several ways:

- **Scale:** They are trained on vastly larger datasets (trillions of tokens for language models, billions of image-text pairs for multimodal models) using orders of magnitude more compute.
- **Generality:** A single foundation model can transfer to dozens or hundreds of distinct tasks across multiple modalities, whereas earlier transfer learning typically involved one source task and one target task.
- **Adaptation methods:** Beyond traditional fine-tuning, foundation models can be adapted through prompting, in-context learning, [reinforcement learning from human feedback (RLHF)](/wiki/rlhf), and parameter-efficient methods such as [LoRA](/wiki/lora), adapters, and prefix tuning.
- **Emergent capabilities:** At sufficient scale, foundation models exhibit capabilities (such as chain-of-thought reasoning or code generation) that were not explicitly present in their training objective, representing a form of transfer that was not anticipated or designed.

The total number of research papers related to foundation models has grown from fewer than 500 publications in 2020 to over 9,000 by 2025, reflecting the centrality of this scaled-up transfer learning paradigm in contemporary AI research.

### Parameter-efficient fine-tuning (PEFT)

As models have grown to billions or even trillions of parameters, full fine-tuning has become increasingly impractical. Retraining all parameters of a 175-billion-parameter model for each downstream task requires prohibitive memory and compute resources. Parameter-efficient fine-tuning (PEFT) methods address this by updating only a small fraction of the model's parameters while keeping the rest frozen.

| Method | Year | Approach | Trainable params (typical) |
|---|---|---|---|
| Adapter modules | 2019 | Insert small trainable feed-forward bottleneck modules between frozen Transformer layers | 1-5% of total (often 2-4%) |
| Prefix tuning | 2021 | Prepend trainable continuous vectors to each Transformer layer's input | Less than 1% |
| [LoRA](/wiki/lora) | 2021 | Inject trainable low-rank decomposition matrices into attention layers | 0.01-0.1% (less than 1%) |
| Prompt tuning | 2021 | Train only continuous prompt embeddings prepended to input | Less than 0.1% |
| [QLoRA](/wiki/qlora) | 2023 | LoRA applied to quantized (4-bit) base models | 0.01-0.1% (less than 1%) |

LoRA (Low-Rank Adaptation), introduced by Hu et al. in 2021, has become one of the most widely adopted PEFT methods [13]. By freezing the pre-trained model and adding small trainable low-rank matrices to the attention layers, LoRA, in the authors' words, "can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times" relative to fine-tuning GPT-3 175B with Adam, while matching or exceeding its quality [13]. QLoRA extends this approach by applying LoRA to models stored in 4-bit quantized format, enabling fine-tuning of 65-billion-parameter models on a single 48 GB GPU.

These methods represent a shift in how transfer learning is practiced: rather than updating all of a model's parameters, practitioners modify only a small fraction of parameters or add lightweight modules, preserving the general knowledge encoded in the frozen base model while adapting behavior for specific tasks. This makes it practical to maintain a single large pre-trained model and create lightweight, task-specific adapters, preserving the benefits of transfer learning while drastically reducing storage and compute costs.

## Transfer learning in reinforcement learning

Transfer learning in reinforcement learning (RL) addresses the challenge of training agents in one environment and deploying them in another. This is particularly relevant in robotics, where training directly in the physical world is expensive, slow, and potentially dangerous.

### Sim-to-real transfer

Sim-to-real transfer involves training RL policies in simulated environments and then deploying them on physical robots [19]. The central challenge is the "reality gap," the mismatch between simulation dynamics and real-world physics, sensor noise, and environmental variability [19]. Key techniques for bridging this gap include:

- **Domain randomization.** Training parameters (lighting, textures, physics properties, sensor noise) are randomized across many simulation episodes so that the policy learns to be robust to variation, treating the real world as just another variation.
- **Domain adaptation.** The simulation is systematically adjusted to better match real-world conditions, or learned feature representations are aligned between simulated and real domains.
- **System identification.** Physical parameters of the real system are estimated and used to calibrate the simulator.

OpenAI demonstrated a notable example in 2019 when a robotic hand trained entirely in simulation learned to solve a Rubik's Cube in the real world, relying heavily on domain randomization to bridge the sim-to-real gap.

## Cross-lingual transfer

Cross-lingual transfer learning applies knowledge learned from one language (typically a high-resource language like English) to improve performance on tasks in other languages, especially low-resource ones with limited labeled data.

Multilingual pre-trained models like multilingual BERT (mBERT) and XLM-RoBERTa (XLM-R) are trained on text from dozens or hundreds of languages simultaneously. These models develop shared cross-lingual representations that allow a model fine-tuned on English data alone to perform surprisingly well on the same task in other languages without any target-language labeled data.

XLM-R, trained on data from 100 languages using over two terabytes of filtered CommonCrawl text, significantly outperformed mBERT on cross-lingual benchmarks [17]. For example, it achieved 80% average accuracy on the XNLI natural language inference benchmark across 15 languages, despite being fine-tuned only on English training data [17].

A notable challenge in multilingual models is the "curse of multilinguality": with a fixed model capacity, adding more languages initially improves performance but eventually degrades it as languages compete for limited representational capacity. Scaling model size helps mitigate this effect.

## Practical guidelines

Choosing when and how to apply transfer learning depends on several factors. The following guidelines summarize best practices drawn from both research and industry experience.

**When transfer learning is most beneficial:**

- The target dataset is small (fewer than a few thousand labeled examples).
- A pre-trained model exists for a related domain or task.
- Training from scratch would require prohibitive data or compute resources.
- The source and target domains share meaningful structural similarities.

**When transfer learning may not help (or may hurt):**

- The source and target domains are fundamentally different with no shared structure.
- Abundant labeled data is available for the target task, making pre-training unnecessary.
- The pre-trained model was trained on data with different characteristics (for instance, transferring from natural images to medical images may require careful adaptation).

**Practical tips:**

- Start with feature extraction (frozen pre-trained model) as a baseline; move to fine-tuning only if performance is insufficient.
- Use a learning rate 10 to 100 times smaller than the original pre-training rate when fine-tuning.
- Monitor for overfitting, especially with small target datasets. Early stopping, dropout, and data augmentation are helpful countermeasures.
- Consider progressive unfreezing (ULMFiT style) to balance adaptation with retention of pre-trained knowledge.
- For very large models, use PEFT methods like LoRA rather than full fine-tuning.
- Validate that transfer is actually helping by comparing against a model trained from scratch on the target data.

## Transfer learning approaches by domain

| Domain | Typical Source | Common Transfer Method | Example Application |
|---|---|---|---|
| [Image classification](/wiki/image_classification_models) | [ImageNet](/wiki/imagenet)-pre-trained CNN or ViT | Fine-tuning or feature extraction | Medical image diagnosis, satellite imagery |
| [Object detection](/wiki/object_detection) | COCO-pre-trained [YOLO](/wiki/yolo) or [DETR](/wiki/detr) | Fine-tuning detection head + backbone | Autonomous driving, industrial inspection |
| [Image segmentation](/wiki/image_segmentation) | ImageNet backbone + segmentation head | Fine-tuning encoder-decoder | Cell segmentation, land cover mapping |
| [Text classification](/wiki/sentiment_analysis) | Pre-trained [BERT](/wiki/bert) or [RoBERTa](/wiki/roberta) | Fine-tuning with classification head | Sentiment analysis, spam detection |
| [Question answering](/wiki/question_answering) | Pre-trained [BERT](/wiki/bert) / [T5](/wiki/t5) | Fine-tuning on QA dataset | Customer support, search |
| [Machine translation](/wiki/machine_translation) | Multilingual pre-trained model (mBERT, mT5) | Fine-tuning on parallel corpus | Low-resource language translation |
| Speech recognition | Pre-trained [Wav2Vec 2.0](/wiki/wav2vec) / [Whisper](/wiki/whisper) | Fine-tuning on target language | Transcription for under-resourced languages |
| [Reinforcement learning](/wiki/reinforcement_learning) | Policy pre-trained in simulation | Sim-to-real transfer | Robotics, game AI |
| Drug discovery | Molecular pre-trained model | Fine-tuning on bioactivity data | Predicting drug-target interactions |
| Code generation | Pre-trained code LLM ([Codex](/wiki/openai_codex), StarCoder) | Fine-tuning or prompting | Autocomplete, bug fixing |

## Applications across domains

Transfer learning has found applications across a wide range of fields beyond computer vision and NLP:

| Domain | Source task | Target task | Benefit |
|---|---|---|---|
| Medical imaging | ImageNet classification | Tumor detection, retinal disease screening | Reduced need for expensive expert-labeled medical data |
| Autonomous driving | Simulated environments | Real-world driving | Safer, cheaper development of driving policies |
| Speech recognition | Large multilingual speech data | Low-resource language ASR | Enables ASR for languages with limited recorded data |
| Drug discovery | Molecular property prediction on large datasets | Activity prediction for novel compounds | Accelerates screening of drug candidates |
| Satellite imagery | ImageNet or generic remote sensing data | Crop classification, disaster assessment | Leverages general visual features for specialized earth observation tasks |
| Recommendation systems | User behavior on one platform | Recommendations on a new platform | Cold-start problem mitigation |

Notable application areas in more detail:

- **[Computer vision](/wiki/computer_vision):** Pre-trained convolutional neural networks (CNNs) and [Vision Transformers](/wiki/vision_transformer) are employed to solve tasks like image classification, [object detection](/wiki/object_detection), and [semantic segmentation](/wiki/image_segmentation). Medical imaging benefits heavily from transfer learning because labeled medical data is scarce and expensive to annotate.
- **Natural language processing (NLP):** Transfer learning has enabled significant improvements in NLP tasks, including [sentiment analysis](/wiki/sentiment_analysis), [machine translation](/wiki/machine_translation), and text classification, using pre-trained models such as [BERT](/wiki/bert), [GPT](/wiki/gpt), and [T5](/wiki/t5). The entire modern ecosystem of [large language models](/wiki/large_language_model) is built on the principle of transfer learning.
- **[Reinforcement learning](/wiki/reinforcement_learning):** Transfer learning techniques speed up learning in RL tasks by reusing knowledge from previously learned tasks. Sim-to-real transfer, where policies are first trained in simulation and then transferred to physical robots, is a prominent example.
- **Speech and audio:** Models such as [Wav2Vec 2.0](/wiki/wav2vec) and [Whisper](/wiki/whisper) are pre-trained on large speech datasets and fine-tuned for automatic speech recognition, speaker identification, and emotion detection in low-resource languages.
- **Scientific domains:** Transfer learning has been applied to protein structure prediction, drug discovery, climate modeling, and materials science, where labeled data is often limited but related pre-training tasks (such as predicting molecular properties) provide useful knowledge.

## References

1. Pan, S.J. and Yang, Q. (2010). "A Survey on Transfer Learning." *IEEE Transactions on Knowledge and Data Engineering*, 22(10), 1345-1359.
2. Thrun, S. and Pratt, L. (1998). *Learning to Learn*. Springer. ISBN 978-0-7923-8047-7.
3. Caruana, R. (1997). "Multitask Learning." *Machine Learning*, 28(1), 41-75.
4. Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). "How Transferable Are Features in Deep Neural Networks?" *Advances in Neural Information Processing Systems*, 27, 3320-3328.
5. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. (2014). "DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition." *Proceedings of the 31st International Conference on Machine Learning (ICML)*, 647-655.
6. Razavian, A.S., Azizpour, H., Sullivan, J., and Carlsson, S. (2014). "CNN Features Off-the-Shelf: An Astounding Baseline for Recognition." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops*, 806-813.
7. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). "Deep Contextualized Word Representations." *Proceedings of NAACL-HLT*, 2227-2237.
8. Howard, J. and Ruder, S. (2018). "Universal Language Model Fine-tuning for Text Classification." *Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL)*, 328-339. arXiv:1801.06146.
9. Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." *OpenAI Technical Report*.
10. Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2019). "BERT: [Pre-training](/wiki/pre-training) of Deep Bidirectional Transformers for Language Understanding." *Proceedings of NAACL-HLT*, 4171-4186. arXiv:1810.04805.
11. Brown, T., Mann, B., Ryder, N., et al. (2020). "Language Models Are Few-Shot Learners." *Advances in Neural Information Processing Systems*, 33, 1877-1901. arXiv:2005.14165.
12. Bommasani, R., Hudson, D.A., Adeli, E., et al. (2021). "On the Opportunities and Risks of Foundation Models." *Stanford CRFM Technical Report*. arXiv:2108.07258.
13. Hu, E.J., Shen, Y., Wallis, P., et al. (2021/2022). "LoRA: Low-Rank Adaptation of Large Language Models." *arXiv preprint arXiv:2106.09685* / *Proceedings of ICLR*.
14. Weiss, K., Khoshgoftaar, T.M., and Wang, D. (2016). "A Survey of Transfer Learning." *Journal of Big Data*, 3(1), 9.
15. Zhuang, F., Qi, Z., Duan, K., et al. (2021). "A Comprehensive Survey on Transfer Learning." *Proceedings of the IEEE*, 109(1), 43-76.
16. Ganin, Y., Ustinova, E., Ajakan, H., et al. (2016). "Domain-Adversarial Training of Neural Networks." *Journal of Machine Learning Research*, 17(59), 1-35.
17. Conneau, A., Khandelwal, K., Goyal, N., et al. (2020). "Unsupervised Cross-lingual Representation Learning at Scale." *Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*, 8440-8451.
18. Wang, Z., Dai, Z., Poczos, B., and Carbonell, J. (2019). "Characterizing and Avoiding Negative Transfer." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, 11293-11302.
19. Zhao, W., Queralta, J.P., and Westerlund, T. (2020). "Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: a Survey." *arXiv preprint arXiv:2009.13303*.
20. Radford, A., Kim, J.W., Hallacy, C., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision." *Proceedings of the 38th International Conference on Machine Learning (ICML)*, 8748-8763. arXiv:2103.00020.

