Domain adaptation
Last reviewed
Sources
42 citations
Review status
Source-backed
Revision
v2 · 4,622 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
42 citations
Review status
Source-backed
Revision
v2 · 4,622 words
Add missing citations, update stale details, or suggest a clearer explanation.
Domain adaptation is the subfield of transfer learning that adapts a model trained on a labelled source domain so it performs well on a related but different target domain, where labels are scarce or absent, despite a shift in the input distribution. It is the case of transfer learning in which the task stays the same (the same label space and the same prediction goal) but the source and target inputs are drawn from different distributions, a setting known as covariate or domain shift. Standard supervised machine learning assumes that training and test data are drawn independently and identically (i.i.d.) from the same distribution; domain adaptation relaxes this i.i.d. assumption and tries to characterise, bound, and reduce the loss in target performance that follows.
The field has been shaped by a foundational generalisation bound from Ben-David et al. (2010), which proved that the target error of a source-trained classifier is governed by the source error plus a distribution-divergence term that is estimable from unlabelled data alone [1]. A wave of deep adversarial and discrepancy-based methods followed between 2015 and 2018, and more recently the focus has shifted toward source-free, test-time, and foundation-model-era robustness.
Real-world deployment rarely matches the training distribution. A vision model trained on stock photography may be evaluated on smartphone images. A medical imaging model trained at one hospital may be deployed at another with a different scanner brand and patient demographic. A driving model trained in California will not see the same lighting or signage in Berlin. Failure to adapt produces silent accuracy drops that are hard to detect post-deployment.
This is not a theoretical concern. Recht, Roelofs, Schmidt, and Shankar (2019) carefully reproduced the ImageNet test-set construction process and found that a wide range of state-of-the-art classifiers lost 11 to 14 percent accuracy on the new test set, even though the new images were drawn from the same Flickr pool by the same protocol; the companion CIFAR-10 reconstruction produced drops of 3 to 15 percent [2]. The minutiae of dataset collection can produce a measurable distribution shift that hurts every model.
Foundation-model fine-tuning is implicitly a domain adaptation problem: a pre-trained model encodes a broad source distribution and is adapted to a narrower target task with much less data, often using parameter-efficient methods such as LoRA.
Domain adaptation literature distinguishes several flavours of shift between $P_S(X, Y)$ and $P_T(X, Y)$, each with different identifiability and different algorithmic remedies.
| Shift type | Decomposition | What changes | Canonical example |
|---|---|---|---|
| Covariate shift | $P(Y|X)$ preserved, $P(X)$ differs | Inputs are sampled differently | Different camera or sensor; different population sampling |
| Label / prior shift | $P(X|Y)$ preserved, $P(Y)$ differs | Class frequencies change | Disease prevalence differs between hospitals |
| Concept shift / drift | $P(Y|X)$ differs | Labelling rule changes | User preferences shift over time; sentiment classes redefined |
| Subpopulation shift | Target is a sub-distribution of source | Mass concentrates on a slice | Deploy a model trained on all users only on a minority group |
| Domain shift | General; both marginals and conditionals may move | Different data-generating processes | Photograph vs sketch; English vs Spanish text |
Covariate shift is the most theoretically tractable case because labels can be unbiasedly estimated by importance weighting if $P_T(x) / P_S(x)$ is known. Concept drift (concept shift) is the hardest: without target labels there is no information about how $P(Y \mid X)$ has moved. Deep-learning DA usually addresses domain shift, where multiple components change simultaneously.
The theoretical core is the generalisation bound of Ben-David et al., introduced in Analysis of Representations for Domain Adaptation (NeurIPS 2007) and extended in A theory of learning from different domains (Machine Learning, 2010, vol. 79, pp. 151-175) [1][3]. For a hypothesis class $\mathcal{H}$ and a hypothesis $h$, the target error is bounded by
$$\varepsilon_T(h) \le \varepsilon_S(h) + \tfrac{1}{2} d_{\mathcal{H} \triangle \mathcal{H}}(D_S, D_T) + \lambda$$
where $\varepsilon_S(h)$ is the source error, $\tfrac{1}{2} d_{\mathcal{H} \triangle \mathcal{H}}$ is the symmetric-difference $\mathcal{H}$-divergence between input marginals, and $\lambda$ is the error of the best joint hypothesis on both domains. The bound is intuitive: a model performs well on the target if it performs well on the source, the input distributions are not too different from the hypothesis class's perspective, and at least one good hypothesis exists for both domains. As the authors put it, the divergence measure "can be estimated from finite, unlabeled samples from the domains" [1].
The $\mathcal{H}$-divergence is finite-sample estimable from unlabelled data and can be approximated by training a domain classifier: if a model can easily distinguish source from target features, the divergence is large. This estimator is the conceptual seed of every adversarial method below.
Several other strands have grown from this foundation:
The field uses a fairly fine-grained vocabulary for the precise data and label availability at adaptation time. The three core regimes are supervised DA (a few labelled target examples exist), semi-supervised DA (a small labelled target set alongside a large unlabelled one), and unsupervised DA (no target labels at all), the last of which is the dominant academic setting.
| Setting | Source data | Target data | Labels | Defining paper |
|---|---|---|---|---|
| Unsupervised DA (UDA) | Labelled, available | Unlabelled, available | Source only | Ganin & Lempitsky 2015 |
| Semi-supervised DA | Labelled, available | Mostly unlabelled, few labelled | Source plus a few target | Saito et al. 2019 |
| Few-shot DA | Labelled, available | Few labelled examples per class | Source plus K target | Motiian et al. 2017 |
| Multi-source DA | Multiple labelled source domains | Unlabelled | Sources only | Peng et al. 2019 |
| Open-set DA | Labelled, with C classes | Unlabelled, with classes possibly outside C | Source only | Panareda Busto & Gall 2017 |
| Universal DA | Labelled, with possible private classes | Unlabelled, with possible private classes | Source only | You et al. 2019 |
| Source-free DA (SFDA) | Pre-trained model only, no source data | Unlabelled, available | Implicit in source model | Liang et al. 2020 (SHOT) |
| Test-time adaptation (TTA) | Pre-trained model only | Streaming target examples at inference | None | Wang et al. 2021 (TENT) |
| Domain generalisation (DG) | Multiple source domains | None during training | Sources only | Muandet, Balduzzi, Schölkopf 2013 |
Unsupervised DA is the dominant academic setting because it captures the essential difficulty (no target labels). Source-free DA reflects realistic deployment, where the source data may be proprietary, regulated, or too large to ship to the deployment site. Domain generalisation drops target access entirely. In practice these settings shade into one another.
The methods landscape is large but organises into a small number of families.
| Family | Idea | Representative method | Year |
|---|---|---|---|
| Importance reweighting | Match source marginal to target marginal by weighting samples | KMM (Huang et al.); KLIEP (Sugiyama et al.) | 2007 / 2008 |
| Discrepancy-based feature alignment | Minimise a statistical distance (MMD, CORAL, CMD) between source and target features | DDC (Tzeng et al. 2014); DAN (Long et al. 2015); JAN (Long et al. 2017); Deep CORAL (Sun & Saenko 2016) | 2014 to 2017 |
| Adversarial alignment | Train a domain classifier and force features to fool it | DANN (Ganin & Lempitsky 2015); ADDA (Tzeng et al. 2017); CDAN (Long et al. 2018); MCD (Saito et al. 2018) | 2015 to 2018 |
| Generative / image-translation | Translate source images to target style with a GAN | CyCADA (Hoffman et al. 2018) | 2018 |
| Optimal transport | Align joint feature-label distributions via Wasserstein distance | OTDA (Courty et al. 2016); JDOT (Courty et al. 2017); DeepJDOT (Damodaran et al. 2018) | 2016 to 2018 |
| Self-training / pseudo-labelling | Generate confident target pseudo-labels and retrain | SHOT (Liang et al. 2020); FixMatch-style hybrids | 2019 onward |
| Test-time adaptation | Adapt batch-norm or affine parameters at inference | TENT (Wang et al. 2021); MEMO (Zhang et al. 2022); T3A (Iwasawa & Matsuo 2021) | 2020 onward |
| Domain generalisation | Train on multiple sources to be robust to unseen target | DICA (Muandet et al. 2013); IRM (Arjovsky et al. 2019); GroupDRO (Sagawa et al. 2020); MixStyle (Zhou et al. 2021) | 2013 onward |
Maximum Mean Discrepancy (MMD) is a kernel-based distance between distributions, measuring the difference between the mean embeddings of two samples in a Reproducing Kernel Hilbert Space; the canonical reference is Gretton et al.'s kernel two-sample test (JMLR 2012), which defines the MMD as "the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space" [9]. Long, Cao, Wang, and Jordan (2015) introduced Deep Adaptation Networks (DAN), adding a multi-kernel MMD penalty across multiple task-specific layers of a CNN [10]. Their follow-up Joint Adaptation Networks (JAN, ICML 2017) align the joint distribution of multiple layers rather than marginals layer-by-layer [11].
Sun and Saenko (2016) proposed Deep CORAL in an ECCV 2016 workshop paper [12]. CORAL aligns the second-order statistics of source and target features by minimising the Frobenius distance between covariance matrices; Deep CORAL extends this to a differentiable layer. The original CORAL method was branded "frustratingly easy" by Sun, Feng, and Saenko (AAAI 2016) because the closed-form alignment is a few lines of code, yet it matches more elaborate methods on several benchmarks [13]; Deep CORAL carries the same loss into deep networks.
Adversarial alignment turns the $\mathcal{H}$-divergence into an explicit optimisation. Ganin and Lempitsky (2015) introduced Domain-Adversarial Neural Networks (DANN) at ICML, with a journal version by Ganin et al. (2016) in JMLR (vol. 17, pp. 1-35) [14][15]. The paper argues that "predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains" [15]. DANN attaches a domain classifier to the feature extractor through a gradient reversal layer (GRL) that, in the authors' words, "reverse[s] (or multipl[ies] by -1) the gradient that comes back through it" during backpropagation [15]. The feature extractor is trained to maximise the domain classifier's loss, producing features that are discriminative for the source task but indistinguishable across domains. The method needs no architectural changes other than the GRL.
Tzeng, Hoffman, Saenko, and Darrell (2017) generalised this idea in Adversarial Discriminative Domain Adaptation (ADDA) at CVPR [16]. ADDA uses a two-stage scheme with untied weights between source and target encoders and a standard GAN-style adversarial loss, simpler and more effective on cross-modality tasks such as adapting from RGB to depth.
Long, Cao, Wang, and Jordan (2018) introduced Conditional Domain Adversarial Networks (CDAN) at NeurIPS, which condition the domain discriminator on the cross-covariance between feature representations and classifier predictions [17]. CDAN aligns the joint distribution rather than just the marginal, which helps when the data are multimodal in the class sense.
Saito, Watanabe, Ushiku, and Harada (2018) proposed Maximum Classifier Discrepancy (MCD) at CVPR [18]. MCD trains two task classifiers on top of a shared feature extractor and alternates between maximising and minimising their disagreement on target samples, using the discrepancy as a proxy for whether target features lie inside the source classifier's well-supported region.
Hoffman et al. (2018) introduced CyCADA (Cycle-Consistent Adversarial Domain Adaptation) at ICML, which adapts at both the pixel level and the feature level, building on CycleGAN to translate source images into target style with a semantic consistency loss [19]. The approach was particularly effective on synthetic-to-real urban driving (GTA to Cityscapes).
Courty, Flamary, Tuia, and Rakotomamonjy introduced OTDA in TPAMI 2016, transporting source samples to the target via an entropy-regularised Wasserstein problem [20]. Courty, Flamary, Habrard, and Rakotomamonjy followed with JDOT (Joint Distribution Optimal Transport) at NeurIPS 2017, which aligns joint feature-label distributions and learns the prediction function and transport plan together [21]. DeepJDOT (Damodaran et al. 2018) extended this to deep networks [22].
Self-training adapts a model by using its confident predictions on target data as pseudo-labels for further training. Liang, Hu, and Feng (2020) introduced SHOT (Source HypOthesis Transfer) at ICML, which freezes the source classifier and learns a new feature extractor for the target by combining information maximisation with self-supervised pseudo-labelling [23]. SHOT does not need source data at adaptation time, which matters when source data is private or proprietary.
Wang, Shelhamer, Liu, Olshausen, and Darrell (2021) introduced TENT (Test-time Entropy Minimisation) at ICLR 2021 [24]. TENT updates only the channel-wise affine parameters of batch-normalisation layers at test time, by minimising the entropy of the model's predictions on the test batch. The method has only the test data and the trained model, no source data, and the authors report that it "reaches a new state-of-the-art error on ImageNet-C", the corruption benchmark, while also improving CIFAR-10/100-C, segmentation, and digit-recognition adaptation [24]. Subsequent work in this line includes MEMO (Zhang, Levine & Finn 2022) and T3A (Iwasawa & Matsuo 2021), which adjusts only the classifier's prototypes at test time [25][26].
Domain generalisation removes target access entirely. Muandet, Balduzzi, and Schölkopf (2013) introduced Domain-Invariant Component Analysis (DICA), a kernel method that learns a transformation minimising cross-domain dissimilarity [27]. Modern DG methods include Invariant Risk Minimisation (IRM, Arjovsky et al. 2019), Group Distributionally Robust Optimisation (GroupDRO, Sagawa et al. 2020), and style-mixing methods such as MixStyle (Zhou et al. 2021) [28][29][30].
A notable empirical finding came from Gulrajani and Lopez-Paz (2021) at ICLR. Their DomainBed benchmark unified seven multi-domain datasets, nine baseline algorithms, and three model selection criteria [31]. They reported that, when carefully implemented and tuned with the same hyperparameter budget, plain Empirical Risk Minimisation (ERM) was competitive with all proposed DG methods: in their words, "no algorithm included in DomainBed outperforms ERM by more than one point when evaluated under the same experimental conditions" [31]. This has reshaped how the community evaluates DG.
Progress in the field is measured on a small set of canonical datasets.
| Benchmark | Year | Domains | Classes | Notes |
|---|---|---|---|---|
| Office-31 | 2010 | 3 (Amazon, Webcam, DSLR) | 31 | Saenko, Kulis, Fritz & Darrell at ECCV; 4,652 office-object images (Amazon 2,817, DSLR 498, Webcam 795); standard small-scale UDA |
| Office-Caltech-10 | 2012 | 4 (Amazon, Webcam, DSLR, Caltech) | 10 | Gong et al. extension of Office-31 |
| Office-Home | 2017 | 4 (Art, Clipart, Product, Real-World) | 65 | Venkateswara et al.; 15,588 images; medium-scale |
| VisDA-2017 | 2017 | 2 (synthetic, real) | 12 | Peng et al. challenge dataset; large synthetic-to-real |
| DomainNet | 2019 | 6 (Clipart, Infograph, Painting, Quickdraw, Real, Sketch) | 345 | Peng et al. at ICCV; about 0.6 million images; largest visual DA benchmark |
| PACS | 2017 | 4 (Photo, Art painting, Cartoon, Sketch) | 7 | Li, Yang, Song & Hospedales at ICCV; 9,991 images; main DG benchmark |
| DomainBed | 2021 | varies (PACS, OfficeHome, VLCS, TerraIncognita, etc.) | varies | Gulrajani & Lopez-Paz unified DG testbed (7 datasets, 9 algorithms) |
| WILDS | 2021 | 10 datasets in real domains | varies | Koh et al. at ICML; tumour images, wildlife camera traps, satellite imagery, code, text |
Office-31 contains 4,652 images across 31 categories drawn from three domains: 2,817 images from Amazon listings, 795 low-resolution webcam shots, and 498 high-resolution DSLR photographs [32]. DomainNet is the largest visual DA benchmark, with roughly 0.6 million images across 345 categories and six domains [33].
WILDS aims explicitly at real-world shifts. It bundles 10 datasets, each with a natural domain split: different hospitals (Camelyon17 tumour identification), different camera traps (iWildCam), different times and locations (FMoW satellite imagery), or different countries (PovertyMap) [34]. The authors report that "standard training yields substantially lower out-of-distribution than in-distribution performance on each dataset", a gap that "remains even with models trained by existing methods for tackling distribution shifts" [34].
Domain adaptation shows up wherever deployment differs from training.
Domain adaptation also plays a role in fairness work: subpopulation shift formalises the accuracy gap between majority and minority groups, and methods such as GroupDRO can be cast as DA between subpopulations.
Beyond the algorithm zoo, there are several recurring practical issues that often matter more than the choice of method.
Several problems are active research areas as of 2026.
The classical UDA setting with Office-31 and CNNs trained from scratch is now mostly a teaching example. The current frontier is about source-free and test-time methods, benchmarks like WILDS that resemble real deployment, and how foundation models change the meaningful baselines.