Transfer Learning

See also: Machine learning terms

Transfer learning is a machine learning technique in which knowledge gained from training a model on one task or domain is reused to improve performance on a different but related task or domain. Rather than training a model from scratch for every new problem, transfer learning allows practitioners to leverage representations, parameters, or features learned from large datasets, reducing the amount of data, time, and computational resources required to achieve strong results. Transfer learning has become the dominant paradigm in modern deep learning, underpinning breakthroughs in computer vision, natural language processing, speech recognition, and many other fields.

Training a large neural network on a broad dataset and then adapting it to a specialized downstream task is now standard in both computer vision and natural language processing. The approach is especially valuable when labeled data for the target task is scarce or expensive to collect, because a pre-trained model already captures useful statistical regularities from the source data.

ELI5 (Explain like I'm 5)

Imagine you already know how to ride a bicycle. When you try to learn how to ride a motorcycle, you do not have to start from zero because you already understand balance, steering, and braking. The skills you learned on the bicycle "transfer" to the motorcycle, making it much easier and faster to learn. Transfer learning works the same way for computers. A computer that has already learned to recognize thousands of objects in photos can use that knowledge to quickly learn a new task, like identifying specific bird species, without needing millions of new training examples.

Historical background

Early foundations

The intellectual roots of transfer learning trace back to research on inductive transfer, inductive bias, and "learning to learn" in the early 1990s. In 1992, Lorien Pratt formulated the discriminability-based transfer (DBT) algorithm, one of the first explicit algorithms for transferring knowledge between neural networks trained on different tasks. Pratt published further early work on transfer between neural network tasks in 1993. The topic gained wider attention at the NIPS 1995 workshop titled "Learning to Learn: Knowledge Consolidation and Transfer in Inductive Systems," held in Vail, Colorado and organized by Sebastian Thrun and others. This workshop brought together researchers interested in how a learner could exploit experience on previous tasks to improve performance on new ones.

Rich Caruana's 1997 paper on multi-task learning was another foundational contribution. Caruana framed MTL as "an approach to inductive transfer that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias," and demonstrated that training a neural network on multiple related tasks simultaneously, using shared hidden-layer representations, could improve generalization on each individual task. This work established the principle that shared representations capture useful inductive biases. The following year, Sebastian Thrun and Lorien Pratt edited the book Learning to Learn (1998), which collected foundational work on the subject and helped crystallize transfer learning as a recognized research direction within machine learning.

The Pan and Yang survey (2010)

A pivotal moment in the field came with the publication of "A Survey on Transfer Learning" by Sinno Jialin Pan and Qiang Yang in IEEE Transactions on Knowledge and Data Engineering (2010). This survey provided a comprehensive taxonomy of transfer learning settings for classification, regression, and clustering, classifying transfer learning into inductive, transductive, and unsupervised categories. Pan and Yang formalized the distinction between different transfer learning scenarios based on differences in domains and tasks between source and target, and they discussed the relationship between transfer learning and related topics such as domain adaptation, multi-task learning, sample selection bias, and covariate shift. The paper became one of the most cited works in the field, with over 700 follow-up publications within a few years of its release, and it remains a standard reference for researchers entering the area.

The deep learning era

The deep learning revolution that began around 2012, catalyzed by the success of AlexNet on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), made transfer learning practical at scale. Once researchers demonstrated that deep convolutional neural networks trained on ImageNet learned general-purpose visual features, the practice of pre-training on ImageNet and fine-tuning on downstream tasks quickly became the default approach in computer vision. Jason Yosinski and colleagues published an influential 2014 study quantifying how transferable different neural network layers are, finding that early layers learn general features (edges, textures) while later layers learn task-specific features.

In natural language processing, transfer learning took longer to mature. Word embedding methods like Word2Vec (2013) and GloVe (2014) represented an early form of transfer, providing pre-trained word representations. The real transformation came in 2018 with three landmark contributions: ELMo (Peters et al.), which introduced contextualized word embeddings; ULMFiT (Howard and Ruder), which demonstrated effective language model fine tuning for text classification; and BERT (Devlin et al.), which established pre-training and fine-tuning as the standard NLP paradigm. GPT (Radford et al., 2018) and its successors further demonstrated the power of large-scale language model pre-training.

Formal definition

Following the framework of Pan and Yang (2010), transfer learning can be defined more precisely using the concepts of domain and task.

A domain D consists of a feature space X and a marginal probability distribution P(X), where X = {x_1, x_2, ..., x_n} belongs to X. A task T, given a domain D, consists of a label space Y and a predictive function f(.) that is learned from training data consisting of pairs {x_i, y_i}.

Given a source domain D_S with a corresponding source task T_S, and a target domain D_T with a corresponding target task T_T, transfer learning aims to improve the learning of the target predictive function f_T(.) in D_T using the knowledge in D_S and T_S, where D_S is not equal to D_T or T_S is not equal to T_T. In other words, transfer learning applies whenever the source and target differ in their feature spaces, their marginal distributions, their label spaces, or their conditional distributions.

Types of transfer learning

Pan and Yang's 2010 taxonomy divides transfer learning into three main categories based on the relationship between source and target domains and tasks.

Type	Source labels	Target labels	Domain relationship	Task relationship	Typical methods
Inductive	Available or not	Required	Same or different	Different	Multi-task learning, self-taught learning, fine-tuning
Transductive	Available	Not available	Different	Same	Domain adaptation, sample selection bias correction
Unsupervised	Not available	Not available	Different	Different but related	Transfer clustering, transfer dimensionality reduction

Inductive transfer learning

In inductive transfer learning, the target task differs from the source task regardless of whether the domains are the same or different. Some labeled data in the target domain is required. The primary goal is to improve the performance of the model on the target task by utilizing the knowledge gained from the source task. There are two sub-cases:

With labeled source data: The source domain has abundant labeled data. The model learns a shared representation from the source and adapts it to the target. This setting resembles multi-task learning when both tasks are learned simultaneously.
Without labeled source data (self-taught learning): The source domain has only unlabeled data. The model learns useful representations (for example, through unsupervised pre-training) that transfer to the target task.

Modern fine-tuning of pre-trained model weights on a new task is the most common example of inductive transfer.

Transductive transfer learning

In transductive transfer learning, the source and target tasks are the same, but the domains differ. Labeled data exists only in the source domain. This setting encompasses domain adaptation (where marginal probability distributions differ between source and target) and cross-domain transfer. For example, a sentiment classifier trained on electronics product reviews might be adapted to work on book reviews or movie reviews, where the vocabulary and expression patterns are different but the underlying task is the same.

Unsupervised transfer learning

Unsupervised transfer learning addresses scenarios where no labeled data is available in either domain. The focus is on unsupervised tasks such as clustering or dimensionality reduction, and knowledge from a source domain's unlabeled data is transferred to help with an unsupervised learning task in a target domain. This approach often involves unsupervised feature extraction and unsupervised domain adaptation techniques.

Key approaches: feature extraction vs. fine-tuning

Two primary strategies dominate practical transfer learning: feature extraction and fine-tuning. These strategies differ in how much of the pre-trained model is adapted to the new task.

Feature extraction

In the feature extraction approach, the pre-trained model serves as a fixed feature extractor (a fixed encoder). The model's weights are frozen (not updated during training), and its intermediate or final layer outputs (activations) are used as input features for a new classifier or regression model trained on the target task. This approach works well when the target dataset is small and the source domain is sufficiently similar to the target domain. Because the pre-trained weights are never modified, there is minimal risk of overfitting to a small target dataset, and only the new classifier's parameters are updated during training. However, because the features are fixed, the model cannot adapt its internal representations to the specific nuances of the target task.

For example, a ResNet-50 model pre-trained on ImageNet can have its final classification layer removed, and the 2048-dimensional output of its penultimate layer can serve as a rich feature vector for a new image recognition task. A simple linear classifier trained on these features often achieves surprisingly strong performance.

Fine-tuning

Fine-tuning involves initializing a model with pre-trained weights and then continuing training (with a typically smaller learning rate) on the target dataset. Depending on the amount of target data and the similarity between source and target domains, practitioners may fine-tune all layers of the model or only a subset. This allows the model to adapt its learned representations to the specific characteristics of the new task and domain. Fine-tuning generally outperforms feature extraction when sufficient target data is available and when the source and target domains are somewhat different. However, it carries a higher risk of overfitting, especially when the target dataset is very small. Careful learning rate scheduling, regularization, and strategies like progressive unfreezing help mitigate this risk.

A common strategy is to freeze the early layers of the network (which tend to learn general features like edges and textures) and only fine-tune the later layers (which learn more abstract, task-specific features). Progressive unfreezing, where layers are gradually unfrozen from top to bottom during training, is another effective technique that helps prevent catastrophic forgetting of previously learned representations.

A landmark study by Yosinski et al. (2014), "How Transferable Are Features in Deep Neural Networks?," systematically investigated the generality and specificity of features at different layers of a deep network. The authors found that early layers learn general features (such as edge detectors and color blobs) that are broadly useful across tasks, while later layers learn increasingly task-specific features. They also demonstrated that initializing a network with transferred features from almost any number of layers produces a boost to generalization, even after fine-tuning on the target dataset. This work provided foundational empirical evidence for the effectiveness of both feature extraction and fine-tuning strategies.

Comparison of fine-tuning strategies

Strategy	Weights Updated	Compute Cost	Best When	Risk
Feature extraction	Only new classifier head	Low	Small target dataset, similar domains	Under-fitting on dissimilar targets
Full fine-tuning	All layers	High	Large target dataset, dissimilar domains	Overfitting on small datasets
Partial fine-tuning	Top N layers + classifier	Medium	Moderate target data	Balances adaptation and stability
Progressive unfreezing	Layers unfrozen gradually	Medium-High	Medium target data, risk of catastrophic forgetting	Requires tuning of schedule

When to use each approach

Scenario	Target data size	Domain similarity	Recommended strategy
Small dataset, similar domain	Small (hundreds)	High	Feature extraction
Small dataset, different domain	Small (hundreds)	Low	Feature extraction from early layers, or fine-tune with caution
Large dataset, similar domain	Large (thousands+)	High	Fine-tune all layers
Large dataset, different domain	Large (thousands+)	Low	Fine-tune all layers, possibly from scratch

Transfer learning in computer vision

Computer vision was the first field where deep transfer learning became standard practice. The ImageNet dataset, containing over 1.2 million images across 1,000 categories (and over 14 million images across more than 20,000 categories in its broader form), served as the foundation for training models whose representations proved broadly useful.

ImageNet pre-training

After AlexNet won the ILSVRC 2012 competition, researchers quickly discovered that the features learned by deep CNNs on ImageNet were remarkably general and could be repurposed for tasks far removed from ImageNet classification. Two influential 2014 papers cemented this finding:

DeCAF (Donahue et al., 2014): Evaluated deep convolutional activation features trained on ImageNet for novel recognition tasks including scene recognition, domain adaptation, and fine-grained recognition. The work demonstrated that features from intermediate layers of a CNN trained on ImageNet transferred effectively to a wide variety of visual tasks.
CNN Features Off-the-Shelf (Razavian et al., 2014): Showed that features extracted from a pre-trained CNN, with no task-specific modification, provided an "astounding baseline" for a broad range of recognition problems. This paper popularized the practice of using pre-trained CNN features as a generic starting point for vision tasks.

Today, pre-training on ImageNet (or larger datasets like ImageNet-21k, LAION-5B, or JFT-300M) remains the standard initialization strategy for vision models.

Backbone freezing

Backbone freezing is the practice of keeping the convolutional layers (the "backbone") of a pre-trained model fixed and training only a new task-specific head (for example, a fully connected classification layer). The rationale comes from the hierarchical nature of feature learning in CNNs: early layers capture universal visual features like edges and textures, while later layers develop task-specific representations.

Freezing the backbone preserves these general-purpose features, prevents overfitting on small target datasets, and can reduce GPU memory consumption by up to 28% compared to full fine-tuning. It is especially effective when the target dataset is small and visually similar to the source domain.

Progressive unfreezing

Progressive unfreezing is a training strategy in which layers of a pre-trained model are gradually unfrozen from the top (closest to the output) to the bottom (closest to the input) over the course of training. The approach was popularized by Howard and Ruder (2018) in the context of NLP (see ULMFiT below), but it is equally applicable to vision models.

The procedure typically follows these steps:

Freeze the entire pre-trained backbone and train only the new task-specific head for several epochs.
Unfreeze the top layer group of the backbone and continue training with a reduced learning rate.
Progressively unfreeze additional layer groups, each time lowering the learning rate further.
Optionally, fine-tune the entire model end-to-end with a very small learning rate.

Progressive unfreezing helps prevent catastrophic forgetting (the phenomenon where fine-tuning destroys previously learned features) and yields minimal accuracy loss (often less than 1%) compared to full fine-tuning while providing substantial reductions in compute, memory, and training time.

Common pre-trained CNN architectures

Several architectures trained on ImageNet became standard starting points for transfer learning:

Architecture	Year	Pre-Training Data	Key Feature / Innovation	Parameters (approx.)	Typical Transfer Use
AlexNet	2012	ImageNet	First large-scale deep CNN	~60M	Historical; rarely used today
VGGNet (VGG-16/19)	2014	ImageNet	Deep stacking of 3x3 convolutions; simple, uniform architecture	138M	Feature extraction baseline
GoogLeNet / Inception	2014	ImageNet	Inception modules with parallel convolution paths	6.8M	Efficient feature extraction
ResNet (50/101/152)	2015	ImageNet	Residual connections enabling very deep networks	25M-60M	Widely used backbone
DenseNet	2017	ImageNet	Dense connectivity between layers	8M-20M	Strong classifier backbone
EfficientNet	2019	ImageNet	Compound scaling of depth, width, resolution	5M-66M	High accuracy, efficient
Vision Transformer (ViT)	2020	ImageNet / JFT-300M	Transformer applied to image patches	86M-632M	Strong with large data
Swin Transformer	2021	ImageNet	Shifted window attention	Varies	State-of-the-art backbone
ConvNeXt	2022	ImageNet	Modernized ConvNet	Varies	Competitive with transformers

Yosinski et al. (2014) showed that the first layers of these networks learn general visual features (Gabor-like filters, color blobs) that transfer well across tasks, while higher layers become progressively more task-specific. This finding provided theoretical justification for the practice of freezing early layers while fine-tuning later ones.

Practical workflow

A typical transfer learning workflow in computer vision involves: (1) selecting a pre-trained model appropriate for the task complexity and available compute; (2) replacing the final classification head with a new head matching the number of target classes; (3) optionally freezing early layers; (4) training on the target dataset with a reduced learning rate (often 10 to 100 times smaller than the original training rate); and (5) evaluating and optionally unfreezing additional layers for further fine-tuning.

Smaller models like MobileNet often outperform larger architectures such as ResNet-152 in low-data regimes because they are less prone to overfitting.

Transfer learning in natural language processing

Transfer learning transformed NLP beginning in 2018, a year sometimes called the "ImageNet moment" for language. The field shifted from task-specific architectures to a general "pre-train then fine-tune" paradigm.

Word embeddings as early transfer

Before the era of pre-trained language models, word embeddings such as Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) represented a rudimentary form of transfer learning. These embeddings were trained on large unlabeled corpora and then used as input features for downstream NLP models. However, word embeddings are static (each word receives a single vector regardless of context), which limits their transferability.

ELMo (2018)

ELMo (Embeddings from Language Models), introduced by Peters et al. (2018), represented a major step forward. ELMo produces deep contextualized word representations by running a deep bidirectional LSTM language model on text and combining the internal states of all layers. Unlike static word embeddings, ELMo representations change depending on the surrounding context, capturing polysemy and syntactic roles. Adding ELMo representations to existing NLP models improved the state of the art across six challenging tasks, including question answering, textual entailment, and sentiment analysis.

ULMFiT (2018): a pivotal method

Universal Language Model Fine-tuning (ULMFiT), proposed by Jeremy Howard and Sebastian Ruder (2018), was one of the first methods to demonstrate that a single general-purpose pre-trained language model could be effectively fine-tuned for any text classification task. ULMFiT introduced three key techniques that became standard practice:

Discriminative fine-tuning. Different layers of the language model are tuned with different learning rates. Earlier layers, which capture general linguistic features, receive smaller learning rates, while later layers, which capture more task-specific features, receive larger ones.
Slanted triangular learning rates. The learning rate first increases linearly for a short period to help the model quickly converge to a suitable parameter region for the target task, then decays linearly over a longer period for gradual refinement.
Gradual unfreezing. Rather than fine-tuning all layers simultaneously, layers are unfrozen one at a time from the last layer backward. This prevents catastrophic forgetting of the general knowledge captured in earlier layers.

These techniques allowed ULMFiT to reduce classification error by 18-24% on the majority of benchmark datasets and achieve state-of-the-art results on six text classification datasets, often with only 100 labeled examples, matching performance that previously required 10,000 or more labeled samples.

GPT (2018)

GPT (Generative Pre-trained Transformer), introduced by Radford et al. at OpenAI, applied the pre-train-then-fine-tune paradigm to the Transformer architecture. GPT was pre-trained as a unidirectional (left-to-right) language model on the BooksCorpus dataset and then fine-tuned on downstream tasks with minimal architectural changes. GPT demonstrated that Transformer-based language models, when pre-trained at sufficient scale, could transfer effectively to tasks including textual entailment, question answering, and semantic similarity.

BERT and the pre-train/fine-tune paradigm

BERT (Bidirectional Encoder Representations from Transformers), introduced by Devlin et al. at Google, extended the pre-train-then-fine-tune paradigm by using a masked language model objective that considers both left and right context simultaneously. Pre-trained on large text corpora using masked language modeling and next sentence prediction objectives, BERT's bidirectional representations captured rich contextual information. Fine-tuning BERT on a target task typically required only adding a task-specific output layer and training for a few epochs. BERT achieved state-of-the-art results on eleven NLP benchmarks upon its release and popularized the pattern of releasing pre-trained models that the community could fine-tune for specific tasks.

BERT's success spawned numerous variants (such as RoBERTa, ALBERT, DistilBERT, ELECTRA, and DeBERTa) that improved upon the original through changes to training procedures, model architecture, or efficiency.

Scaling up: GPT-2, GPT-3, and beyond

Subsequent work by OpenAI scaled up the GPT architecture. GPT-2 (2019) demonstrated that increasing model size and training data led to emergent capabilities in zero-shot task performance. GPT-3 (2020), with 175 billion parameters, showed that sufficiently large language models could perform tasks through in-context learning (providing task examples in the prompt) without any gradient-based fine-tuning. This represented a new form of transfer where the model's pre-trained knowledge was activated through prompting rather than parameter updates.

NLP transfer learning timeline

Model	Year	Architecture	Pre-Training Objective	Transfer Method / Key Contribution
Word2Vec	2013	Shallow network	Skip-gram / CBOW	Static embeddings as input features
GloVe	2014	Matrix factorization	Co-occurrence statistics	Static embeddings as input features
ELMo	2018	Bidirectional LSTM	Language modeling	Contextualized embeddings added to task model
ULMFiT	2018	AWD-LSTM	Language modeling	Full model fine-tuning with gradual unfreezing
GPT	2018	Transformer (decoder)	Autoregressive LM	Fine-tuning with task-specific head
BERT	2018	Transformer (encoder)	Masked LM + next sentence prediction	Dominant pre-training approach for NLU tasks
GPT-2 / GPT-3	2019-2020	Transformer (decoder)	Autoregressive LM	Emergent few-shot, zero-shot, in-context learning
T5	2019-2020	Transformer (encoder-decoder)	Text-to-text denoising	Unified diverse NLP tasks under a single format

Domain adaptation

Domain adaptation is a specific form of transductive transfer learning that addresses the problem of training and test data coming from different distributions while sharing the same task. This is one of the most practically important transfer learning scenarios, as real-world deployment conditions frequently differ from training conditions. Common examples include adapting a model trained on synthetic images to work on real photographs, or adapting an NLP model trained on news text to process social media posts.

Types of domain adaptation

Supervised domain adaptation: Some labeled target data is available alongside labeled source data. The goal is to combine both data sources to achieve better target performance than using either alone.
Semi-supervised domain adaptation: A small amount of labeled target data is available along with a large amount of unlabeled target data.
Unsupervised domain adaptation: This is the more challenging and more commonly studied setting, where labeled data is available only in the source domain and only unlabeled data is available in the target domain.

Core techniques

Several families of techniques address domain adaptation:

Technique	Description	Example Methods
Instance reweighting	Assigns different weights to source samples to approximate the target distribution	Importance weighting, Kernel Mean Matching
Feature alignment	Maps source and target features into a shared representation space	Maximum Mean Discrepancy (MMD), Correlation Alignment (CORAL)
Adversarial adaptation	Uses a domain discriminator to encourage domain-invariant features	Domain-Adversarial Neural Networks (DANN), Adversarial Discriminative Domain Adaptation (ADDA)
Batch normalization adaptation	Modulates batch normalization statistics from source to target domain	Adaptive Batch Normalization (AdaBN)
Self-training / pseudo-labeling	Uses model predictions on unlabeled target data as pseudo-labels for further training	Noisy Student Training

Adversarial approaches, inspired by generative adversarial networks, train a domain discriminator to distinguish source from target features while the feature extractor is trained to fool the discriminator. This encourages domain-invariant representations. Domain-Adversarial Neural Networks (DANN), introduced by Ganin et al. (2016), are a representative example.

Few-shot and zero-shot learning

Few-shot learning and zero-shot learning represent extreme forms of transfer, where models must generalize to new tasks or classes with minimal or no task-specific training examples.

Zero-shot learning

In zero-shot learning, the model must generalize to classes or tasks it has never seen during training. This is made possible by transferring knowledge through shared semantic representations, attributes, or natural language descriptions. For example, a vision model trained to recognize horses and stripes can recognize a zebra by combining these concepts, even if it has never seen a zebra during training. OpenAI's CLIP model demonstrated that a model trained on image-text pairs could classify images into arbitrary categories described in natural language without any task-specific training data. Similarly, GPT-3 showed that sufficiently large language models can perform new tasks given only a natural language description of the task, a capability described as "zero-shot task transfer." In the context of large language models, zero-shot learning takes the form of prompting: the model uses its pre-trained knowledge to perform a task described in natural language without any task-specific examples.

Few-shot learning

Few-shot learning involves adapting to new tasks using only a handful of labeled examples (typically 1 to 10 per class). Common approaches include:

Metric learning: Learning an embedding space where examples from the same class are close together (e.g., Prototypical Networks, Matching Networks).
Meta-learning: Training the model to learn how to learn, so it can rapidly adapt to new tasks from few examples. Model-Agnostic Meta-Learning (MAML) exposes models to many small learning tasks during training to make them easily adaptable.
In-context learning: Providing a few examples in the input prompt of a large language model, which then generalizes to new instances without gradient updates (e.g., GPT-3).

These capabilities arise because large-scale pre-training on diverse data allows models to develop general-purpose representations that capture broad knowledge about the world, which can then be directed toward specific tasks through prompts, instructions, or minimal examples.

Negative transfer

Transfer learning does not always improve performance. When the source and target are sufficiently dissimilar, transferring knowledge from the source can actually degrade performance on the target task. This phenomenon is known as negative transfer. Understanding and avoiding negative transfer is a critical practical concern.

Causes of negative transfer

Factor	Description	Example
Domain divergence	Source and target distributions are too dissimilar	Medical X-rays vs. satellite imagery
Task conflict	Source task objectives conflict with target task	Sentiment analysis model transferred to topic classification
Feature misalignment	Shared features have different meanings across domains	"Bank" meaning financial institution vs. river bank
Over-transfer	Too many source parameters are rigidly transferred	Freezing too many layers when domains differ substantially
Irrelevant source data	Source data contains patterns specific to the source that are irrelevant to the target	Model relies on spurious patterns and performs worse than training from scratch
Insufficient target data for adaptation	With very little target data, fine-tuning may overfit to noise	Tiny labeled target set cannot meaningfully adapt transferred features
Mismatched model capacity	Transferring from an overly complex model to a simple target (or vice versa) introduces optimization difficulties	Huge pre-trained backbone applied to a tiny niche task

Detecting and mitigating negative transfer

Researchers have developed several strategies to detect and mitigate negative transfer:

Source selection: Carefully choose a source domain or task related to the target, measuring domain similarity with metrics like proxy A-distance or Maximum Mean Discrepancy.
Selective transfer: Identify and transfer only the layers or features most likely to be beneficial, discarding or re-initializing others.
Regularization: Apply regularization to balance between source knowledge and target adaptation.
Curriculum-based approaches: Gradually increase the influence of source knowledge during training.
Multi-source transfer: Combine knowledge from multiple diverse source domains to reduce the risk that any single mismatched source dominates.
Validation monitoring: Monitor performance on a target validation set during training and stop transfer when performance degrades.

Wang et al. (2019) provided a formal characterization of negative transfer in the context of computer vision and proposed methods for avoiding it through careful source-target task relationship modeling.

As a general guideline, transfer learning is most likely to help when the source and target share similar low-level features, when the target dataset is small relative to model capacity, and when the source model was trained on a large and diverse dataset.

Multi-task learning and its relationship to transfer learning

Multi-task learning (MTL) is closely related to transfer learning but differs in its simultaneous training objective. While transfer learning typically involves sequential stages (pre-train on source, then adapt to target), multi-task learning trains a model on multiple tasks at the same time using shared representations.

Caruana (1997) described multi-task learning as "an approach to inductive transfer that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias." In a multi-task setup, related tasks share hidden layers in a neural network, and what is learned for one task can help other tasks be learned more effectively.

The key differences between MTL and sequential transfer learning are:

Timing: MTL trains on all tasks simultaneously; sequential transfer learning trains on the source task first and then adapts to the target task.
Objective: MTL aims to improve performance on all tasks jointly; transfer learning typically prioritizes performance on the target task.
Data requirements: MTL requires labeled data for all tasks during training; transfer learning may use unlabeled source data or a source task with different labels.

Despite these differences, the two approaches share the core principle that learning from related tasks provides useful inductive bias. In practice, many modern systems blend both: a model might be pre-trained with multiple auxiliary tasks (MTL) and then fine-tuned on a specific target task (sequential transfer). For example, T5 is pre-trained with a multi-task mixture of unsupervised and supervised objectives and then fine-tuned on individual downstream tasks.

Foundation models and parameter-efficient fine-tuning

The concept of foundation models, popularized by the Stanford Center for Research on Foundation Models (CRFM) in 2021, represents transfer learning taken to its logical extreme. A foundation model is a large model pre-trained on broad data at scale that can be adapted to a wide range of downstream tasks. Examples include GPT-4, Claude, Gemini, PaLM, LLaMA, CLIP, DALL-E, and Stable Diffusion.

Foundation models differ from earlier transfer learning approaches in several ways:

Scale: They are trained on vastly larger datasets (trillions of tokens for language models, billions of image-text pairs for multimodal models) using orders of magnitude more compute.
Generality: A single foundation model can transfer to dozens or hundreds of distinct tasks across multiple modalities, whereas earlier transfer learning typically involved one source task and one target task.
Adaptation methods: Beyond traditional fine-tuning, foundation models can be adapted through prompting, in-context learning, reinforcement learning from human feedback (RLHF), and parameter-efficient methods such as LoRA, adapters, and prefix tuning.
Emergent capabilities: At sufficient scale, foundation models exhibit capabilities (such as chain-of-thought reasoning or code generation) that were not explicitly present in their training objective, representing a form of transfer that was not anticipated or designed.

The total number of research papers related to foundation models has grown from fewer than 500 publications in 2020 to over 9,000 by 2025, reflecting the centrality of this scaled-up transfer learning paradigm in contemporary AI research.

Parameter-efficient fine-tuning (PEFT)

As models have grown to billions or even trillions of parameters, full fine-tuning has become increasingly impractical. Retraining all parameters of a 175-billion-parameter model for each downstream task requires prohibitive memory and compute resources. Parameter-efficient fine-tuning (PEFT) methods address this by updating only a small fraction of the model's parameters while keeping the rest frozen.

Method	Year	Approach	Trainable params (typical)
Adapter modules	2019	Insert small trainable feed-forward bottleneck modules between frozen Transformer layers	1-5% of total (often 2-4%)
Prefix tuning	2021	Prepend trainable continuous vectors to each Transformer layer's input	Less than 1%
LoRA	2021	Inject trainable low-rank decomposition matrices into attention layers	0.01-0.1% (less than 1%)
Prompt tuning	2021	Train only continuous prompt embeddings prepended to input	Less than 0.1%
QLoRA	2023	LoRA applied to quantized (4-bit) base models	0.01-0.1% (less than 1%)

LoRA (Low-Rank Adaptation), introduced by Hu et al. in 2021, has become one of the most widely adopted PEFT methods. By freezing the pre-trained model and adding small trainable low-rank matrices to the attention layers, LoRA reduces the number of trainable parameters by up to 10,000 times compared to full fine-tuning while matching or exceeding its performance. QLoRA extends this approach by applying LoRA to models stored in 4-bit quantized format, enabling fine-tuning of 65-billion-parameter models on a single 48 GB GPU.

These methods represent a shift in how transfer learning is practiced: rather than updating all of a model's parameters, practitioners modify only a small fraction of parameters or add lightweight modules, preserving the general knowledge encoded in the frozen base model while adapting behavior for specific tasks. This makes it practical to maintain a single large pre-trained model and create lightweight, task-specific adapters, preserving the benefits of transfer learning while drastically reducing storage and compute costs.

Transfer learning in reinforcement learning

Transfer learning in reinforcement learning (RL) addresses the challenge of training agents in one environment and deploying them in another. This is particularly relevant in robotics, where training directly in the physical world is expensive, slow, and potentially dangerous.

Sim-to-real transfer

Sim-to-real transfer involves training RL policies in simulated environments and then deploying them on physical robots. The central challenge is the "reality gap," the mismatch between simulation dynamics and real-world physics, sensor noise, and environmental variability. Key techniques for bridging this gap include:

Domain randomization. Training parameters (lighting, textures, physics properties, sensor noise) are randomized across many simulation episodes so that the policy learns to be robust to variation, treating the real world as just another variation.
Domain adaptation. The simulation is systematically adjusted to better match real-world conditions, or learned feature representations are aligned between simulated and real domains.
System identification. Physical parameters of the real system are estimated and used to calibrate the simulator.

OpenAI demonstrated a notable example in 2019 when a robotic hand trained entirely in simulation learned to solve a Rubik's Cube in the real world, relying heavily on domain randomization to bridge the sim-to-real gap.

Cross-lingual transfer

Cross-lingual transfer learning applies knowledge learned from one language (typically a high-resource language like English) to improve performance on tasks in other languages, especially low-resource ones with limited labeled data.

Multilingual pre-trained models like multilingual BERT (mBERT) and XLM-RoBERTa (XLM-R) are trained on text from dozens or hundreds of languages simultaneously. These models develop shared cross-lingual representations that allow a model fine-tuned on English data alone to perform surprisingly well on the same task in other languages without any target-language labeled data.

XLM-R, trained on data from 100 languages using over two terabytes of filtered CommonCrawl text, significantly outperformed mBERT on cross-lingual benchmarks. For example, it achieved 80% average accuracy on the XNLI natural language inference benchmark across 15 languages, despite being fine-tuned only on English training data.

A notable challenge in multilingual models is the "curse of multilinguality": with a fixed model capacity, adding more languages initially improves performance but eventually degrades it as languages compete for limited representational capacity. Scaling model size helps mitigate this effect.

Practical guidelines

Choosing when and how to apply transfer learning depends on several factors. The following guidelines summarize best practices drawn from both research and industry experience.

When transfer learning is most beneficial:

The target dataset is small (fewer than a few thousand labeled examples).
A pre-trained model exists for a related domain or task.
Training from scratch would require prohibitive data or compute resources.
The source and target domains share meaningful structural similarities.

When transfer learning may not help (or may hurt):

The source and target domains are fundamentally different with no shared structure.
Abundant labeled data is available for the target task, making pre-training unnecessary.
The pre-trained model was trained on data with different characteristics (for instance, transferring from natural images to medical images may require careful adaptation).

Practical tips:

Start with feature extraction (frozen pre-trained model) as a baseline; move to fine-tuning only if performance is insufficient.
Use a learning rate 10 to 100 times smaller than the original pre-training rate when fine-tuning.
Monitor for overfitting, especially with small target datasets. Early stopping, dropout, and data augmentation are helpful countermeasures.
Consider progressive unfreezing (ULMFiT style) to balance adaptation with retention of pre-trained knowledge.
For very large models, use PEFT methods like LoRA rather than full fine-tuning.
Validate that transfer is actually helping by comparing against a model trained from scratch on the target data.

Transfer learning approaches by domain

Domain	Typical Source	Common Transfer Method	Example Application
Image classification	ImageNet-pre-trained CNN or ViT	Fine-tuning or feature extraction	Medical image diagnosis, satellite imagery
Object detection	COCO-pre-trained YOLO or DETR	Fine-tuning detection head + backbone	Autonomous driving, industrial inspection
Image segmentation	ImageNet backbone + segmentation head	Fine-tuning encoder-decoder	Cell segmentation, land cover mapping
Text classification	Pre-trained BERT or RoBERTa	Fine-tuning with classification head	Sentiment analysis, spam detection
Question answering	Pre-trained BERT / T5	Fine-tuning on QA dataset	Customer support, search
Machine translation	Multilingual pre-trained model (mBERT, mT5)	Fine-tuning on parallel corpus	Low-resource language translation
Speech recognition	Pre-trained Wav2Vec 2.0 / Whisper	Fine-tuning on target language	Transcription for under-resourced languages
Reinforcement learning	Policy pre-trained in simulation	Sim-to-real transfer	Robotics, game AI
Drug discovery	Molecular pre-trained model	Fine-tuning on bioactivity data	Predicting drug-target interactions
Code generation	Pre-trained code LLM (Codex, StarCoder)	Fine-tuning or prompting	Autocomplete, bug fixing

Applications across domains

Transfer learning has found applications across a wide range of fields beyond computer vision and NLP:

Domain	Source task	Target task	Benefit
Medical imaging	ImageNet classification	Tumor detection, retinal disease screening	Reduced need for expensive expert-labeled medical data
Autonomous driving	Simulated environments	Real-world driving	Safer, cheaper development of driving policies
Speech recognition	Large multilingual speech data	Low-resource language ASR	Enables ASR for languages with limited recorded data
Drug discovery	Molecular property prediction on large datasets	Activity prediction for novel compounds	Accelerates screening of drug candidates
Satellite imagery	ImageNet or generic remote sensing data	Crop classification, disaster assessment	Leverages general visual features for specialized earth observation tasks
Recommendation systems	User behavior on one platform	Recommendations on a new platform	Cold-start problem mitigation

Notable application areas in more detail:

Computer vision: Pre-trained convolutional neural networks (CNNs) and Vision Transformers are employed to solve tasks like image classification, object detection, and semantic segmentation. Medical imaging benefits heavily from transfer learning because labeled medical data is scarce and expensive to annotate.
Natural language processing (NLP): Transfer learning has enabled significant improvements in NLP tasks, including sentiment analysis, machine translation, and text classification, using pre-trained models such as BERT, GPT, and T5. The entire modern ecosystem of large language models is built on the principle of transfer learning.
Reinforcement learning: Transfer learning techniques speed up learning in RL tasks by reusing knowledge from previously learned tasks. Sim-to-real transfer, where policies are first trained in simulation and then transferred to physical robots, is a prominent example.
Speech and audio: Models such as Wav2Vec 2.0 and Whisper are pre-trained on large speech datasets and fine-tuned for automatic speech recognition, speaker identification, and emotion detection in low-resource languages.
Scientific domains: Transfer learning has been applied to protein structure prediction, drug discovery, climate modeling, and materials science, where labeled data is often limited but related pre-training tasks (such as predicting molecular properties) provide useful knowledge.

References

Pan, S.J. and Yang, Q. (2010). "A Survey on Transfer Learning." *IEEE Transactions on Knowledge and Data Engineering*, 22(10), 1345-1359.
Thrun, S. and Pratt, L. (1998). *Learning to Learn*. Springer. ISBN 978-0-7923-8047-7.
Caruana, R. (1997). "Multitask Learning." *Machine Learning*, 28(1), 41-75.
Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). "How Transferable Are Features in Deep Neural Networks?" *Advances in Neural Information Processing Systems*, 27, 3320-3328.
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. (2014). "DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition." *Proceedings of the 31st International Conference on Machine Learning (ICML)*, 647-655.
Razavian, A.S., Azizpour, H., Sullivan, J., and Carlsson, S. (2014). "CNN Features Off-the-Shelf: An Astounding Baseline for Recognition." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops*, 806-813.
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). "Deep Contextualized Word Representations." *Proceedings of NAACL-HLT*, 2227-2237.
Howard, J. and Ruder, S. (2018). "Universal Language Model Fine-tuning for Text Classification." *Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL)*, 328-339.
Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." *OpenAI Technical Report*.
Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *Proceedings of NAACL-HLT*, 4171-4186.
Brown, T., Mann, B., Ryder, N., et al. (2020). "Language Models Are Few-Shot Learners." *Advances in Neural Information Processing Systems*, 33, 1877-1901.
Bommasani, R., Hudson, D.A., Adeli, E., et al. (2021). "On the Opportunities and Risks of Foundation Models." *Stanford CRFM Technical Report*.
Hu, E.J., Shen, Y., Wallis, P., et al. (2021/2022). "LoRA: Low-Rank Adaptation of Large Language Models." *arXiv preprint arXiv:2106.09685* / *Proceedings of ICLR*.
Weiss, K., Khoshgoftaar, T.M., and Wang, D. (2016). "A Survey of Transfer Learning." *Journal of Big Data*, 3(1), 9.
Zhuang, F., Qi, Z., Duan, K., et al. (2021). "A Comprehensive Survey on Transfer Learning." *Proceedings of the IEEE*, 109(1), 43-76.
Ganin, Y., Ustinova, E., Ajakan, H., et al. (2016). "Domain-Adversarial Training of Neural Networks." *Journal of Machine Learning Research*, 17(59), 1-35.
Conneau, A., Khandelwal, K., Goyal, N., et al. (2020). "Unsupervised Cross-lingual Representation Learning at Scale." *Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*, 8440-8451.
Wang, Z., Dai, Z., Poczos, B., and Carbonell, J. (2019). "Characterizing and Avoiding Negative Transfer." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, 11293-11302.
Zhao, W., Queralta, J.P., and Westerlund, T. (2020). "Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: a Survey." *arXiv preprint arXiv:2009.13303*.

ELI5 (Explain like I'm 5)

Historical background

Early foundations

The Pan and Yang survey (2010)

The deep learning era

Formal definition

Types of transfer learning

Inductive transfer learning

Transductive transfer learning

Unsupervised transfer learning

Key approaches: feature extraction vs. fine-tuning

Feature extraction

Fine-tuning

Comparison of fine-tuning strategies

When to use each approach

Transfer learning in computer vision

ImageNet pre-training

Backbone freezing

Progressive unfreezing

Common pre-trained CNN architectures

Practical workflow

Transfer learning in natural language processing

Word embeddings as early transfer

ELMo (2018)

ULMFiT (2018): a pivotal method

GPT (2018)

BERT and the pre-train/fine-tune paradigm

Scaling up: GPT-2, GPT-3, and beyond

NLP transfer learning timeline

Domain adaptation

Types of domain adaptation

Core techniques

Few-shot and zero-shot learning

Zero-shot learning

Few-shot learning

Negative transfer

Causes of negative transfer

Detecting and mitigating negative transfer

Multi-task learning and its relationship to transfer learning

Foundation models and parameter-efficient fine-tuning

Parameter-efficient fine-tuning (PEFT)

Transfer learning in reinforcement learning

Sim-to-real transfer

Cross-lingual transfer

Practical guidelines

Transfer learning approaches by domain

Applications across domains

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Context window

Multi-head Latent Attention

ELI5 (Explain like I'm 5)

Historical background

Early foundations

The Pan and Yang survey (2010)

The deep learning era

Formal definition

Types of transfer learning

Inductive transfer learning

Transductive transfer learning

Unsupervised transfer learning

Key approaches: feature extraction vs. fine-tuning

Feature extraction

Fine-tuning

Comparison of fine-tuning strategies

When to use each approach

Transfer learning in computer vision

ImageNet pre-training

Backbone freezing

Progressive unfreezing

Common pre-trained CNN architectures

Practical workflow

Transfer learning in natural language processing

Word embeddings as early transfer

ELMo (2018)