Domain adaptation
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,987 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,987 words
Add missing citations, update stale details, or suggest a clearer explanation.
Domain adaptation is the subfield of transfer learning that addresses distribution shift between a source domain, where labelled training data is plentiful, and a target domain, where labels are scarce or absent. The goal is to train a model that performs well on the target distribution despite the shift. Standard supervised machine learning assumes that training and test data are drawn independently and identically from the same distribution; domain adaptation relaxes this i.i.d. assumption and tries to characterise, bound, and reduce the loss in target performance that follows.
The field has been shaped by a foundational generalisation bound from Ben-David et al. (2010), a wave of deep adversarial and discrepancy-based methods between 2015 and 2018, and a more recent shift toward source-free, test-time, and foundation-model-era robustness.
Real-world deployment rarely matches the training distribution. A vision model trained on stock photography may be evaluated on smartphone images. A medical imaging model trained at one hospital may be deployed at another with a different scanner brand and patient demographic. A driving model trained in California will not see the same lighting or signage in Berlin. Failure to adapt produces silent accuracy drops that are hard to detect post-deployment.
This is not a theoretical concern. Recht, Roelofs, Schmidt, and Shankar (2019) carefully reproduced the ImageNet test-set construction process and found that a wide range of state-of-the-art classifiers lost 11 to 14 percent accuracy on the new test set, even though the new images were drawn from the same Flickr pool by the same protocol. The minutiae of dataset collection can produce a measurable distribution shift that hurts every model.
Foundation-model fine-tuning is implicitly a domain adaptation problem: a pre-trained model encodes a broad source distribution and is adapted to a narrower target task with much less data, often using parameter-efficient methods such as LoRA.
Domain adaptation literature distinguishes several flavours of shift between $P_S(X, Y)$ and $P_T(X, Y)$, each with different identifiability and different algorithmic remedies.
| Shift type | Decomposition | What changes | Canonical example |
|---|---|---|---|
| Covariate shift | $P(Y|X)$ preserved, $P(X)$ differs | Inputs are sampled differently | Different camera or sensor; different population sampling |
| Label / prior shift | $P(X|Y)$ preserved, $P(Y)$ differs | Class frequencies change | Disease prevalence differs between hospitals |
| Concept shift / drift | $P(Y|X)$ differs | Labelling rule changes | User preferences shift over time; sentiment classes redefined |
| Subpopulation shift | Target is a sub-distribution of source | Mass concentrates on a slice | Deploy a model trained on all users only on a minority group |
| Domain shift | General; both marginals and conditionals may move | Different data-generating processes | Photograph vs sketch; English vs Spanish text |
Covariate shift is the most theoretically tractable case because labels can be unbiasedly estimated by importance weighting if $P_T(x) / P_S(x)$ is known. Concept shift is the hardest: without target labels there is no information about how $P(Y \mid X)$ has moved. Deep-learning DA usually addresses domain shift, where multiple components change simultaneously.
The theoretical core is the generalisation bound of Ben-David et al., introduced in Analysis of Representations for Domain Adaptation (NeurIPS 2007) and extended in A theory of learning from different domains (Machine Learning, 2010). For a hypothesis class $\mathcal{H}$ and a hypothesis $h$, the target error is bounded by
$$\varepsilon_T(h) \le \varepsilon_S(h) + d_{\mathcal{H} \triangle \mathcal{H}}(D_S, D_T) + \lambda$$
where $\varepsilon_S(h)$ is the source error, $d_{\mathcal{H} \triangle \mathcal{H}}$ is the symmetric-difference $\mathcal{H}$-divergence between input marginals, and $\lambda$ is the error of the best joint hypothesis on both domains. The bound is intuitive: a model performs well on the target if it performs well on the source, the input distributions are not too different from the hypothesis class's perspective, and at least one good hypothesis exists for both domains.
The $\mathcal{H}$-divergence is finite-sample estimable from unlabelled data and can be approximated by training a domain classifier: if a model can easily distinguish source from target features, the divergence is large. This estimator is the conceptual seed of every adversarial method below.
Several other strands have grown from this foundation:
The field uses a fairly fine-grained vocabulary for the precise data and label availability at adaptation time.
| Setting | Source data | Target data | Labels | Defining paper |
|---|---|---|---|---|
| Unsupervised DA (UDA) | Labelled, available | Unlabelled, available | Source only | Ganin & Lempitsky 2015 |
| Semi-supervised DA | Labelled, available | Mostly unlabelled, few labelled | Source plus a few target | Saito et al. 2019 |
| Few-shot DA | Labelled, available | Few labelled examples per class | Source plus K target | Motiian et al. 2017 |
| Multi-source DA | Multiple labelled source domains | Unlabelled | Sources only | Peng et al. 2019 |
| Open-set DA | Labelled, with C classes | Unlabelled, with classes possibly outside C | Source only | Panareda Busto & Gall 2017 |
| Universal DA | Labelled, with possible private classes | Unlabelled, with possible private classes | Source only | You et al. 2019 |
| Source-free DA (SFDA) | Pre-trained model only, no source data | Unlabelled, available | Implicit in source model | Liang et al. 2020 (SHOT) |
| Test-time adaptation (TTA) | Pre-trained model only | Streaming target examples at inference | None | Wang et al. 2021 (TENT) |
| Domain generalisation (DG) | Multiple source domains | None during training | Sources only | Muandet, Balduzzi, Schölkopf 2013 |
Unsupervised DA is the dominant academic setting because it captures the essential difficulty (no target labels). Source-free DA reflects realistic deployment, where the source data may be proprietary, regulated, or too large to ship to the deployment site. Domain generalisation drops target access entirely. In practice these settings shade into one another.
The methods landscape is large but organises into a small number of families.
| Family | Idea | Representative method | Year |
|---|---|---|---|
| Importance reweighting | Match source marginal to target marginal by weighting samples | KMM (Huang et al.); KLIEP (Sugiyama et al.) | 2007 / 2008 |
| Discrepancy-based feature alignment | Minimise a statistical distance (MMD, CORAL, CMD) between source and target features | DDC (Tzeng et al. 2014); DAN (Long et al. 2015); JAN (Long et al. 2017); Deep CORAL (Sun & Saenko 2016) | 2014 to 2017 |
| Adversarial alignment | Train a domain classifier and force features to fool it | DANN (Ganin & Lempitsky 2015); ADDA (Tzeng et al. 2017); CDAN (Long et al. 2018); MCD (Saito et al. 2018) | 2015 to 2018 |
| Generative / image-translation | Translate source images to target style with a GAN | CyCADA (Hoffman et al. 2018) | 2018 |
| Optimal transport | Align joint feature-label distributions via Wasserstein distance | OTDA (Courty et al. 2016); JDOT (Courty et al. 2017); DeepJDOT (Damodaran et al. 2018) | 2016 to 2018 |
| Self-training / pseudo-labelling | Generate confident target pseudo-labels and retrain | SHOT (Liang et al. 2020); FixMatch-style hybrids | 2019 onward |
| Test-time adaptation | Adapt batch-norm or affine parameters at inference | TENT (Wang et al. 2021); MEMO (Zhang et al. 2022); T3A (Iwasawa & Matsuo 2021) | 2020 onward |
| Domain generalisation | Train on multiple sources to be robust to unseen target | DICA (Muandet et al. 2013); IRM (Arjovsky et al. 2019); GroupDRO (Sagawa et al. 2020); MixStyle (Zhou et al. 2021) | 2013 onward |
Maximum Mean Discrepancy (MMD) is a kernel-based distance between distributions, measuring the difference between mean embeddings of two samples in a Reproducing Kernel Hilbert Space (Gretton et al. 2007). Long, Cao, Wang, and Jordan (2015) introduced Deep Adaptation Networks (DAN), adding a multi-kernel MMD penalty across multiple task-specific layers of a CNN. Their follow-up Joint Adaptation Networks (JAN, ICML 2017) align the joint distribution of multiple layers rather than marginals layer-by-layer.
Sun and Saenko (2016) proposed Deep CORAL in an ECCV 2016 workshop paper. CORAL aligns the second-order statistics of source and target features by minimising the Frobenius distance between covariance matrices; Deep CORAL extends this to a differentiable layer. The authors call CORAL a "frustratingly easy" method because the loss is a single line of code, yet it matches more elaborate methods on several benchmarks.
Adversarial alignment turns the $\mathcal{H}$-divergence into an explicit optimisation. Ganin and Lempitsky (2015) introduced Domain-Adversarial Neural Networks (DANN) at ICML, with a journal version by Ganin et al. (2016) in JMLR. DANN attaches a domain classifier to the feature extractor through a gradient reversal layer (GRL) that multiplies the gradient by a negative constant during backpropagation. The feature extractor is trained to maximise the domain classifier's loss, producing features that are discriminative for the source task but indistinguishable across domains. The method needs no architectural changes other than the GRL.
Tzeng, Hoffman, Saenko, and Darrell (2017) generalised this idea in Adversarial Discriminative Domain Adaptation (ADDA) at CVPR. ADDA uses a two-stage scheme with untied weights between source and target encoders and a standard GAN-style adversarial loss, simpler and more effective on cross-modality tasks such as adapting from RGB to depth.
Long, Cao, Wang, and Jordan (2018) introduced Conditional Domain Adversarial Networks (CDAN) at NeurIPS, which condition the domain discriminator on the cross-covariance between feature representations and classifier predictions. CDAN aligns the joint distribution rather than just the marginal, which helps when the data are multimodal in the class sense.
Saito, Watanabe, Ushiku, and Harada (2018) proposed Maximum Classifier Discrepancy (MCD) at CVPR. MCD trains two task classifiers on top of a shared feature extractor and alternates between maximising and minimising their disagreement on target samples, using the discrepancy as a proxy for whether target features lie inside the source classifier's well-supported region.
Hoffman et al. (2018) introduced CyCADA (Cycle-Consistent Adversarial Domain Adaptation) at ICML, which adapts at both the pixel level and the feature level, building on CycleGAN to translate source images into target style with a semantic consistency loss. The approach was particularly effective on synthetic-to-real urban driving (GTA to Cityscapes).
Courty, Flamary, Tuia, and Rakotomamonjy introduced OTDA in TPAMI 2016, transporting source samples to the target via an entropy-regularised Wasserstein problem. Courty, Flamary, Habrard, and Rakotomamonjy followed with JDOT (Joint Distribution Optimal Transport) at NeurIPS 2017, which aligns joint feature-label distributions and learns the prediction function and transport plan together. DeepJDOT (Damodaran et al. 2018) extended this to deep networks.
Self-training adapts a model by using its confident predictions on target data as pseudo-labels for further training. Liang, Hu, and Feng (2020) introduced SHOT (Source HypOthesis Transfer) at ICML, which freezes the source classifier and learns a new feature extractor for the target by combining information maximisation with self-supervised pseudo-labelling. SHOT does not need source data at adaptation time, which matters when source data is private or proprietary.
Wang, Shelhamer, Liu, Olshausen, and Darrell (2021) introduced TENT (Test-time Entropy Minimisation) at ICLR 2021. TENT updates only the channel-wise affine parameters of batch-norm layers at test time, by minimising the entropy of the model's predictions on the test batch. The method has only the test data and the trained model, no source data. TENT reduced error substantially on ImageNet-C, the corruption benchmark, and worked on segmentation and digit-recognition adaptation. Subsequent work in this line includes MEMO (Zhang, Levine & Finn 2022) and T3A (Iwasawa & Matsuo 2021), which adjusts only the classifier's prototypes at test time.
Domain generalisation removes target access entirely. Muandet, Balduzzi, and Schölkopf (2013) introduced Domain-Invariant Component Analysis (DICA), a kernel method that learns a transformation minimising cross-domain dissimilarity. Modern DG methods include Invariant Risk Minimisation (IRM, Arjovsky et al. 2019), Group Distributionally Robust Optimisation (GroupDRO, Sagawa et al. 2020), and style-mixing methods such as MixStyle (Zhou et al. 2021).
A notable empirical finding came from Gulrajani and Lopez-Paz (2021) at ICLR. Their DomainBed benchmark unified 7 datasets, 14 algorithms, and 3 model selection criteria. They reported that, when tuned with the same hyperparameter budget, plain Empirical Risk Minimisation (ERM) was competitive with all proposed DG methods, and no method beat ERM by more than one point on average. This has reshaped how the community evaluates DG.
Progress in the field is measured on a small set of canonical datasets.
| Benchmark | Year | Domains | Classes | Notes |
|---|---|---|---|---|
| Office-31 | 2010 | 3 (Amazon, Webcam, DSLR) | 31 | Saenko, Kulis, Fritz & Darrell at ECCV; 4,652 office object images; standard small-scale UDA |
| Office-Caltech-10 | 2012 | 4 (Amazon, Webcam, DSLR, Caltech) | 10 | Gong et al. extension of Office-31 |
| Office-Home | 2017 | 4 (Art, Clipart, Product, Real-World) | 65 | Venkateswara et al.; ~15,500 images; medium-scale |
| VisDA-2017 | 2017 | 2 (synthetic, real) | 12 | Peng et al. challenge dataset; large synthetic-to-real |
| DomainNet | 2019 | 6 (Clipart, Infograph, Painting, Quickdraw, Real, Sketch) | 345 | Peng et al. at ICCV; ~600,000 images; largest visual DA benchmark |
| PACS | 2017 | 4 (Photo, Art painting, Cartoon, Sketch) | 7 | Li, Yang, Song & Hospedales at ICCV; 9,991 images; main DG benchmark |
| DomainBed | 2021 | varies (PACS, OfficeHome, VLCS, TerraIncognita, etc.) | varies | Gulrajani & Lopez-Paz unified DG testbed |
| WILDS | 2021 | 10 datasets in real domains | varies | Koh et al. at ICML; tumour images, wildlife camera traps, satellite imagery, code, text |
WILDS aims explicitly at real-world shifts. Each of its constituent datasets contains a natural domain split: different hospitals (Camelyon17 tumour identification), different camera traps (iWildCam), different times and locations (FMoW satellite imagery), or different countries (PovertyMap). The authors report a consistent gap between in-distribution and out-of-distribution accuracy on every dataset that persists even with state-of-the-art DA methods.
Domain adaptation shows up wherever deployment differs from training.
Domain adaptation also plays a role in fairness work: subpopulation shift formalises the accuracy gap between majority and minority groups, and methods such as GroupDRO can be cast as DA between subpopulations.
Beyond the algorithm zoo, there are several recurring practical issues that often matter more than the choice of method.
Several problems are active research areas as of 2026.
The classical UDA setting with Office-31 and CNNs trained from scratch is now mostly a teaching example. The current frontier is about source-free and test-time methods, benchmarks like WILDS that resemble real deployment, and how foundation models change the meaningful baselines.