# Domain adaptation

> Source: https://aiwiki.ai/wiki/domain_adaptation
> Updated: 2026-06-24
> Categories: Machine Learning, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Domain adaptation** is the subfield of [transfer learning](/wiki/transfer_learning) that adapts a model trained on a labelled *source domain* so it performs well on a related but different *target domain*, where labels are scarce or absent, despite a shift in the input distribution. It is the case of transfer learning in which the task stays the same (the same label space and the same prediction goal) but the source and target inputs are drawn from different distributions, a setting known as covariate or domain shift. Standard supervised [machine learning](/wiki/machine_learning) assumes that training and test data are drawn independently and identically (i.i.d.) from the same distribution; domain adaptation relaxes this i.i.d. assumption and tries to characterise, bound, and reduce the loss in target performance that follows.

The field has been shaped by a foundational generalisation bound from Ben-David et al. (2010), which proved that the target error of a source-trained classifier is governed by the source error plus a distribution-divergence term that is estimable from unlabelled data alone [1]. A wave of deep adversarial and discrepancy-based methods followed between 2015 and 2018, and more recently the focus has shifted toward source-free, test-time, and foundation-model-era robustness.

## Why domain adaptation matters

Real-world deployment rarely matches the training distribution. A vision model trained on stock photography may be evaluated on smartphone images. A medical imaging model trained at one hospital may be deployed at another with a different scanner brand and patient demographic. A driving model trained in California will not see the same lighting or signage in Berlin. Failure to adapt produces silent accuracy drops that are hard to detect post-deployment.

This is not a theoretical concern. Recht, Roelofs, Schmidt, and Shankar (2019) carefully reproduced the [ImageNet](/wiki/imagenet) test-set construction process and found that a wide range of state-of-the-art classifiers lost 11 to 14 percent accuracy on the new test set, even though the new images were drawn from the same Flickr pool by the same protocol; the companion CIFAR-10 reconstruction produced drops of 3 to 15 percent [2]. The minutiae of dataset collection can produce a measurable distribution shift that hurts every model.

Foundation-model fine-tuning is implicitly a domain adaptation problem: a [pre-trained](/wiki/pre-training) model encodes a broad source distribution and is adapted to a narrower target task with much less data, often using parameter-efficient methods such as [LoRA](/wiki/lora).

## What kinds of distribution shift exist?

Domain adaptation literature distinguishes several flavours of shift between $P_S(X, Y)$ and $P_T(X, Y)$, each with different identifiability and different algorithmic remedies.

| Shift type | Decomposition | What changes | Canonical example |
|---|---|---|---|
| Covariate shift | $P(Y\|X)$ preserved, $P(X)$ differs | Inputs are sampled differently | Different camera or sensor; different population sampling |
| Label / prior shift | $P(X\|Y)$ preserved, $P(Y)$ differs | Class frequencies change | Disease prevalence differs between hospitals |
| Concept shift / drift | $P(Y\|X)$ differs | Labelling rule changes | User preferences shift over time; sentiment classes redefined |
| Subpopulation shift | Target is a sub-distribution of source | Mass concentrates on a slice | Deploy a model trained on all users only on a minority group |
| Domain shift | General; both marginals and conditionals may move | Different data-generating processes | Photograph vs sketch; English vs Spanish text |

Covariate shift is the most theoretically tractable case because labels can be unbiasedly estimated by importance weighting if $P_T(x) / P_S(x)$ is known. [Concept drift](/wiki/concept_drift) (concept shift) is the hardest: without target labels there is no information about how $P(Y \mid X)$ has moved. Deep-learning DA usually addresses domain shift, where multiple components change simultaneously.

## Theoretical foundations

The theoretical core is the generalisation bound of Ben-David et al., introduced in *Analysis of Representations for Domain Adaptation* (NeurIPS 2007) and extended in *A theory of learning from different domains* (Machine Learning, 2010, vol. 79, pp. 151-175) [1][3]. For a hypothesis class $\mathcal{H}$ and a hypothesis $h$, the target error is bounded by

$$\varepsilon_T(h) \le \varepsilon_S(h) + \tfrac{1}{2} d_{\mathcal{H} \triangle \mathcal{H}}(D_S, D_T) + \lambda$$

where $\varepsilon_S(h)$ is the source error, $\tfrac{1}{2} d_{\mathcal{H} \triangle \mathcal{H}}$ is the symmetric-difference $\mathcal{H}$-divergence between input marginals, and $\lambda$ is the error of the best joint hypothesis on both domains. The bound is intuitive: a model performs well on the target if it performs well on the source, the input distributions are not too different from the hypothesis class's perspective, and at least one good hypothesis exists for both domains. As the authors put it, the divergence measure "can be estimated from finite, unlabeled samples from the domains" [1].

The $\mathcal{H}$-divergence is finite-sample estimable from unlabelled data and can be approximated by training a domain classifier: if a model can easily distinguish source from target [features](/wiki/feature), the divergence is large. This estimator is the conceptual seed of every adversarial method below.

Several other strands have grown from this foundation:

- **Importance weighting under covariate shift.** Sugiyama, Krauledat, and Müller (2007) proposed importance-weighted cross validation, and Sugiyama et al. (2008) introduced KLIEP (Kullback-Leibler Importance Estimation Procedure) [4]. Huang et al. (2007) introduced Kernel Mean Matching (KMM) [5]. Both estimate $w(x) = P_T(x) / P_S(x)$ directly.
- **PAC-Bayes for domain adaptation.** Germain, Habrard, Laviolette, and Morvant (2013) extended the PAC-Bayes framework to the adaptation setting [6].
- **Limits of unsupervised adaptation.** Ben-David and Urner (2014) showed that without strong assumptions, no algorithm can guarantee target accuracy from labelled source and unlabelled target alone [7]. Some structural assumption, typically the existence of a low-error joint hypothesis, is unavoidable.
- **Tighter bounds for label shift.** Tachet des Combes, Zhao, Wang, and Gordon (2020) proposed Generalised Label Shift assumptions and a corresponding correction that improves over plain feature alignment when label proportions also differ [8].

## What are the settings of domain adaptation?

The field uses a fairly fine-grained vocabulary for the precise data and label availability at adaptation time. The three core regimes are supervised DA (a few labelled target examples exist), semi-supervised DA (a small labelled target set alongside a large unlabelled one), and unsupervised DA (no target labels at all), the last of which is the dominant academic setting.

| Setting | Source data | Target data | Labels | Defining paper |
|---|---|---|---|---|
| Unsupervised DA (UDA) | Labelled, available | [Unlabelled](/wiki/unsupervised), available | Source only | Ganin & Lempitsky 2015 |
| Semi-supervised DA | Labelled, available | Mostly unlabelled, few labelled | Source plus a few target | Saito et al. 2019 |
| Few-shot DA | Labelled, available | Few labelled examples per class | Source plus K target | Motiian et al. 2017 |
| Multi-source DA | Multiple labelled source domains | Unlabelled | Sources only | Peng et al. 2019 |
| Open-set DA | Labelled, with C classes | Unlabelled, with classes possibly outside C | Source only | Panareda Busto & Gall 2017 |
| Universal DA | Labelled, with possible private classes | Unlabelled, with possible private classes | Source only | You et al. 2019 |
| Source-free DA (SFDA) | Pre-trained model only, no source data | Unlabelled, available | Implicit in source model | Liang et al. 2020 (SHOT) |
| Test-time adaptation (TTA) | Pre-trained model only | Streaming target examples at inference | None | Wang et al. 2021 (TENT) |
| Domain generalisation (DG) | Multiple source domains | None during training | Sources only | Muandet, Balduzzi, Schölkopf 2013 |

Unsupervised DA is the dominant academic setting because it captures the essential difficulty (no target labels). Source-free DA reflects realistic deployment, where the source data may be proprietary, regulated, or too large to ship to the deployment site. Domain generalisation drops target access entirely. In practice these settings shade into one another.

## Methods

The methods landscape is large but organises into a small number of families.

| Family | Idea | Representative method | Year |
|---|---|---|---|
| Importance reweighting | Match source marginal to target marginal by weighting samples | KMM (Huang et al.); KLIEP (Sugiyama et al.) | 2007 / 2008 |
| Discrepancy-based feature alignment | Minimise a statistical distance (MMD, CORAL, CMD) between source and target features | DDC (Tzeng et al. 2014); DAN (Long et al. 2015); JAN (Long et al. 2017); Deep CORAL (Sun & Saenko 2016) | 2014 to 2017 |
| Adversarial alignment | Train a domain classifier and force features to fool it | DANN (Ganin & Lempitsky 2015); ADDA (Tzeng et al. 2017); CDAN (Long et al. 2018); MCD (Saito et al. 2018) | 2015 to 2018 |
| Generative / image-translation | Translate source images to target style with a [GAN](/wiki/generative_adversarial_network) | CyCADA (Hoffman et al. 2018) | 2018 |
| Optimal transport | Align joint feature-label distributions via Wasserstein distance | OTDA (Courty et al. 2016); JDOT (Courty et al. 2017); DeepJDOT (Damodaran et al. 2018) | 2016 to 2018 |
| Self-training / pseudo-labelling | Generate confident target pseudo-labels and retrain | SHOT (Liang et al. 2020); FixMatch-style hybrids | 2019 onward |
| Test-time adaptation | Adapt batch-norm or affine parameters at inference | TENT (Wang et al. 2021); MEMO (Zhang et al. 2022); T3A (Iwasawa & Matsuo 2021) | 2020 onward |
| Domain generalisation | Train on multiple sources to be robust to unseen target | DICA (Muandet et al. 2013); IRM (Arjovsky et al. 2019); GroupDRO (Sagawa et al. 2020); MixStyle (Zhou et al. 2021) | 2013 onward |

### Discrepancy-based feature alignment

Maximum Mean Discrepancy (MMD) is a kernel-based distance between distributions, measuring the difference between the mean embeddings of two samples in a Reproducing Kernel Hilbert Space; the canonical reference is Gretton et al.'s kernel two-sample test (JMLR 2012), which defines the MMD as "the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space" [9]. Long, Cao, Wang, and Jordan (2015) introduced **Deep Adaptation Networks (DAN)**, adding a multi-kernel MMD penalty across multiple task-specific layers of a CNN [10]. Their follow-up Joint Adaptation Networks (JAN, ICML 2017) align the joint distribution of multiple layers rather than marginals layer-by-layer [11].

Sun and Saenko (2016) proposed **Deep CORAL** in an ECCV 2016 workshop paper [12]. CORAL aligns the second-order statistics of source and target features by minimising the Frobenius distance between covariance matrices; Deep CORAL extends this to a differentiable layer. The original CORAL method was branded "frustratingly easy" by Sun, Feng, and Saenko (AAAI 2016) because the closed-form alignment is a few lines of code, yet it matches more elaborate methods on several benchmarks [13]; Deep CORAL carries the same loss into deep networks.

### How do adversarial domain adaptation methods work?

Adversarial alignment turns the $\mathcal{H}$-divergence into an explicit optimisation. Ganin and Lempitsky (2015) introduced **Domain-Adversarial Neural Networks (DANN)** at ICML, with a journal version by Ganin et al. (2016) in JMLR (vol. 17, pp. 1-35) [14][15]. The paper argues that "predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains" [15]. DANN attaches a domain classifier to the feature extractor through a **gradient reversal layer (GRL)** that, in the authors' words, "reverse[s] (or multipl[ies] by -1) the gradient that comes back through it" during backpropagation [15]. The feature extractor is trained to maximise the domain classifier's loss, producing features that are discriminative for the source task but indistinguishable across domains. The method needs no architectural changes other than the GRL.

Tzeng, Hoffman, Saenko, and Darrell (2017) generalised this idea in **Adversarial Discriminative Domain Adaptation (ADDA)** at CVPR [16]. ADDA uses a two-stage scheme with untied weights between source and target encoders and a standard GAN-style adversarial loss, simpler and more effective on cross-modality tasks such as adapting from RGB to depth.

Long, Cao, Wang, and Jordan (2018) introduced **Conditional Domain Adversarial Networks (CDAN)** at NeurIPS, which condition the domain discriminator on the cross-covariance between feature representations and classifier predictions [17]. CDAN aligns the joint distribution rather than just the marginal, which helps when the data are multimodal in the class sense.

Saito, Watanabe, Ushiku, and Harada (2018) proposed **Maximum Classifier Discrepancy (MCD)** at CVPR [18]. MCD trains two task classifiers on top of a shared feature extractor and alternates between maximising and minimising their disagreement on target samples, using the discrepancy as a proxy for whether target features lie inside the source classifier's well-supported region.

### Generative, optimal-transport, and self-training methods

Hoffman et al. (2018) introduced **CyCADA (Cycle-Consistent Adversarial Domain Adaptation)** at ICML, which adapts at both the pixel level and the feature level, building on [CycleGAN](/wiki/cyclegan) to translate source images into target style with a semantic consistency loss [19]. The approach was particularly effective on synthetic-to-real urban driving (GTA to Cityscapes).

Courty, Flamary, Tuia, and Rakotomamonjy introduced **OTDA** in TPAMI 2016, transporting source samples to the target via an entropy-regularised Wasserstein problem [20]. Courty, Flamary, Habrard, and Rakotomamonjy followed with **JDOT (Joint Distribution Optimal Transport)** at NeurIPS 2017, which aligns joint feature-label distributions and learns the prediction function and transport plan together [21]. DeepJDOT (Damodaran et al. 2018) extended this to deep networks [22].

Self-training adapts a model by using its confident predictions on target data as pseudo-labels for further training. Liang, Hu, and Feng (2020) introduced **SHOT (Source HypOthesis Transfer)** at ICML, which freezes the source classifier and learns a new feature extractor for the target by combining information maximisation with self-supervised pseudo-labelling [23]. SHOT does not need source data at adaptation time, which matters when source data is private or proprietary.

### Test-time adaptation

Wang, Shelhamer, Liu, Olshausen, and Darrell (2021) introduced **TENT (Test-time Entropy Minimisation)** at ICLR 2021 [24]. TENT updates only the channel-wise affine parameters of [batch-normalisation](/wiki/batch_normalization) layers at test time, by minimising the entropy of the model's predictions on the test batch. The method has only the test data and the trained model, no source data, and the authors report that it "reaches a new state-of-the-art error on ImageNet-C", the corruption benchmark, while also improving CIFAR-10/100-C, segmentation, and digit-recognition adaptation [24]. Subsequent work in this line includes MEMO (Zhang, Levine & Finn 2022) and T3A (Iwasawa & Matsuo 2021), which adjusts only the classifier's prototypes at test time [25][26].

### Domain generalisation methods

Domain generalisation removes target access entirely. Muandet, Balduzzi, and Schölkopf (2013) introduced **Domain-Invariant Component Analysis (DICA)**, a kernel method that learns a transformation minimising cross-domain dissimilarity [27]. Modern DG methods include Invariant Risk Minimisation (IRM, Arjovsky et al. 2019), Group Distributionally Robust Optimisation (GroupDRO, Sagawa et al. 2020), and style-mixing methods such as MixStyle (Zhou et al. 2021) [28][29][30].

A notable empirical finding came from Gulrajani and Lopez-Paz (2021) at ICLR. Their **DomainBed** benchmark unified seven multi-domain datasets, nine baseline algorithms, and three model selection criteria [31]. They reported that, when carefully implemented and tuned with the same hyperparameter budget, plain [Empirical Risk Minimisation (ERM)](/wiki/empirical_risk_minimization) was competitive with all proposed DG methods: in their words, "no algorithm included in DomainBed outperforms ERM by more than one point when evaluated under the same experimental conditions" [31]. This has reshaped how the community evaluates DG.

## What are the standard domain adaptation benchmarks?

Progress in the field is measured on a small set of canonical datasets.

| Benchmark | Year | Domains | Classes | Notes |
|---|---|---|---|---|
| Office-31 | 2010 | 3 (Amazon, Webcam, DSLR) | 31 | Saenko, Kulis, Fritz & Darrell at ECCV; 4,652 office-object images (Amazon 2,817, DSLR 498, Webcam 795); standard small-scale UDA |
| Office-Caltech-10 | 2012 | 4 (Amazon, Webcam, DSLR, Caltech) | 10 | Gong et al. extension of Office-31 |
| Office-Home | 2017 | 4 (Art, Clipart, Product, Real-World) | 65 | Venkateswara et al.; 15,588 images; medium-scale |
| VisDA-2017 | 2017 | 2 (synthetic, real) | 12 | Peng et al. challenge dataset; large synthetic-to-real |
| DomainNet | 2019 | 6 (Clipart, Infograph, Painting, Quickdraw, Real, Sketch) | 345 | Peng et al. at ICCV; about 0.6 million images; largest visual DA benchmark |
| PACS | 2017 | 4 (Photo, Art painting, Cartoon, Sketch) | 7 | Li, Yang, Song & Hospedales at ICCV; 9,991 images; main DG benchmark |
| DomainBed | 2021 | varies (PACS, OfficeHome, VLCS, TerraIncognita, etc.) | varies | Gulrajani & Lopez-Paz unified DG testbed (7 datasets, 9 algorithms) |
| WILDS | 2021 | 10 datasets in real domains | varies | Koh et al. at ICML; tumour images, wildlife camera traps, satellite imagery, code, text |

Office-31 contains 4,652 images across 31 categories drawn from three domains: 2,817 images from Amazon listings, 795 low-resolution webcam shots, and 498 high-resolution DSLR photographs [32]. DomainNet is the largest visual DA benchmark, with roughly 0.6 million images across 345 categories and six domains [33].

WILDS aims explicitly at real-world shifts. It bundles 10 datasets, each with a natural domain split: different hospitals (Camelyon17 tumour identification), different camera traps (iWildCam), different times and locations (FMoW satellite imagery), or different countries (PovertyMap) [34]. The authors report that "standard training yields substantially lower out-of-distribution than in-distribution performance on each dataset", a gap that "remains even with models trained by existing methods for tackling distribution shifts" [34].

## What is domain adaptation used for?

Domain adaptation shows up wherever deployment differs from training.

- **Medical imaging.** Models trained at one hospital systematically underperform at others because of scanner brand, acquisition protocol, demographics, and disease prevalence. Adaptation methods are routinely benchmarked on tumour classification (Camelyon17), chest X-rays, and retinal imaging.
- **Autonomous driving and sim-to-real.** Driving policies and perception modules are typically trained in simulation (CARLA, GTA, AirSim) and adapted to real video. Tobin, Fong, Ray, Schneider, Zaremba, and Abbeel (2017) introduced **domain randomisation** at IROS, which randomises non-essential simulator properties (lighting, textures, distractors) so that the real world looks like just another simulated variant; they trained a real-world object detector accurate to 1.5 cm using only simulated images with random non-realistic textures [35]. It is now one of the dominant [sim-to-real](/wiki/sim_to_real_transfer) techniques in robotics.
- **Robotics manipulation.** Grasp models trained in simulation are adapted to physical robots through randomisation, image translation, or progressive networks.
- **Cross-domain NLP.** Sentiment analysis from movie reviews to product reviews; biomedical NER; cross-lingual transfer for low-resource languages.
- **Speech recognition.** Cross-accent, cross-microphone, and cross-language [ASR](/wiki/speech_recognition) adaptation. Modern ASR combines self-supervised pretraining ([wav2vec 2.0](/wiki/wav2vec), [HuBERT](/wiki/hubert)) with task-specific [fine-tuning](/wiki/fine_tuning).
- **Recommender systems.** Cross-domain recommendation, where interactions in a source domain inform predictions in a target domain for the same users.
- **Industrial fault diagnosis and remote sensing.** Adaptation across machines, satellite sensors, geographic regions, and time of acquisition.

Domain adaptation also plays a role in fairness work: subpopulation shift formalises the accuracy gap between majority and minority groups, and methods such as GroupDRO can be cast as DA between subpopulations.

## Connection to broader concepts

- **Transfer learning** is the umbrella concept; DA is the case where source and target tasks are the same but input distributions differ. Pan and Yang (2010) is the canonical survey [36].
- **Multi-task learning** trains on multiple tasks at once, which can implicitly help DA.
- **Continual / lifelong learning** addresses sequential domain shifts, where the model must adapt without forgetting.
- **Self-supervised pre-training.** Models such as [CLIP](/wiki/clip), DINO, and DINOv2 produce representations that often transfer well to new domains. Empirical work since 2021 has shown that these foundation models often beat purpose-built DA methods on standard benchmarks because their source distribution is broad.
- **Foundation models and prompt-based adaptation.** Few-shot prompting and parameter-efficient adaptation (LoRA, prefix tuning, adapters) are forms of DA that exploit a strong pre-trained representation.
- **Robustness to distribution shift.** A nearby literature studies natural and adversarial robustness, including the Recht et al. study and work on [adversarial attacks](/wiki/adversarial_attack) and corruption robustness (ImageNet-C).

## Practical considerations

Beyond the algorithm zoo, there are several recurring practical issues that often matter more than the choice of method.

- **Hyperparameter selection without target labels.** With no labelled target data, standard cross-validation cannot be used. Workarounds include source validation, importance-weighted validation, and a small held-out target set, but Gulrajani and Lopez-Paz showed that this choice can swing reported accuracies by several points [31].
- **Unfair comparisons.** Papers historically used different backbones, augmentation, and training budgets, so apparent improvements on Office-31 sometimes reflected those rather than the proposed method. DomainBed and WILDS were designed to fix this [31][34].
- **Negative transfer.** When source and target are too different, DA can hurt rather than help. The $\lambda$ term in the Ben-David bound formalises this: if no joint hypothesis exists, no algorithm will succeed [1].
- **Data efficiency vs collecting target labels.** Few-shot adaptation typically beats elaborate UDA once a few hundred target labels are available, especially with a strong pre-trained backbone.
- **Foundation-model fine-tuning often beats classical DA.** A 2023-2024 line of work has shown that supervised fine-tuning of a CLIP backbone on the source matches or exceeds the best UDA methods on Office-Home and DomainNet without any target data at all.
- **Batch-norm statistics matter.** Many gains in test-time adaptation come from re-estimating batch-norm statistics on target data alone, which is essentially free.

## Open challenges

Several problems are active research areas as of 2026.

- **Theoretical guarantees with deep networks.** The classical bounds assume a fixed hypothesis class; deep models effectively learn the hypothesis class, which complicates the divergence terms.
- **Source-free and federated DA.** As privacy and data-residency rules tighten, shipping source data to deployment sites is often impossible. Source-free methods such as SHOT and federated DA address this.
- **Test-time adaptation in non-stationary streams.** TENT and successors assume i.i.d. test batches; real deployment streams may include adversarial or out-of-distribution inputs.
- **Multi-modal DA.** Adapting models that combine vision, language, and other modalities introduces new shifts, since each modality may move independently.
- **Foundation-model robustness.** Quantifying when and why a CLIP-style backbone generalises across domains, and where it still fails (medical imaging, low-resource languages, specialised industrial sensors).
- **Fairness during adaptation.** Adaptation methods can preserve or exacerbate disparities across subpopulations.
- **Continual / lifelong DA.** Making models that adapt to new domains without forgetting old ones and without access to old source data.

The classical UDA setting with Office-31 and CNNs trained from scratch is now mostly a teaching example. The current frontier is about source-free and test-time methods, benchmarks like WILDS that resemble real deployment, and how foundation models change the meaningful baselines.

## References

1. Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., & Vaughan, J. W. (2010). A theory of learning from different domains. *Machine Learning*, 79(1-2), 151-175. https://www.alexkulesza.com/pubs/adapt_mlj10.pdf
2. Recht, B., Roelofs, R., Schmidt, L., & Shankar, V. (2019). Do ImageNet Classifiers Generalize to ImageNet? *ICML*. arXiv:1902.10811. https://arxiv.org/abs/1902.10811
3. Ben-David, S., Blitzer, J., Crammer, K., & Pereira, F. (2007). Analysis of Representations for Domain Adaptation. *NeurIPS*. https://papers.nips.cc/paper/2006/hash/b1b0432ceafb0ce714426e9114852ac7-Abstract.html
4. Sugiyama, M., Krauledat, M., & Müller, K.-R. (2007). Covariate Shift Adaptation by Importance Weighted Cross Validation. *JMLR*, 8, 985-1005. https://www.jmlr.org/papers/v8/sugiyama07a.html
5. Huang, J., Smola, A. J., Gretton, A., Borgwardt, K. M., & Schölkopf, B. (2007). Correcting Sample Selection Bias by Unlabeled Data. *NeurIPS*. https://papers.nips.cc/paper/2006/hash/a2186aa7c086b46ad4e8bf81e2a3a19b-Abstract.html
6. Germain, P., Habrard, A., Laviolette, F., & Morvant, E. (2013). A PAC-Bayesian Approach for Domain Adaptation with Specialization to Linear Classifiers. *ICML*. https://proceedings.mlr.press/v28/germain13.html
7. Ben-David, S., & Urner, R. (2014). Domain adaptation: a sample complexity analysis (Domain Adaptation: Can Quantity Compensate for Quality?). *ALT / Annals of Mathematics and Artificial Intelligence*. https://link.springer.com/article/10.1007/s10472-013-9371-9
8. Tachet des Combes, R., Zhao, H., Wang, Y.-X., & Gordon, G. J. (2020). Domain Adaptation with Conditional Distribution Matching and Generalized Label Shift. *NeurIPS*. arXiv:2003.04475. https://arxiv.org/abs/2003.04475
9. Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A Kernel Two-Sample Test. *JMLR*, 13, 723-773. https://www.jmlr.org/papers/v13/gretton12a.html
10. Long, M., Cao, Y., Wang, J., & Jordan, M. I. (2015). Learning Transferable Features with Deep Adaptation Networks. *ICML*. arXiv:1502.02791. https://arxiv.org/abs/1502.02791
11. Long, M., Zhu, H., Wang, J., & Jordan, M. I. (2017). Deep Transfer Learning with Joint Adaptation Networks. *ICML*. arXiv:1605.06636. https://arxiv.org/abs/1605.06636
12. Sun, B., & Saenko, K. (2016). Deep CORAL: Correlation Alignment for Deep Domain Adaptation. *ECCV Workshops*. arXiv:1607.01719. https://arxiv.org/abs/1607.01719
13. Sun, B., Feng, J., & Saenko, K. (2016). Return of Frustratingly Easy Domain Adaptation. *AAAI*. arXiv:1511.05547. https://arxiv.org/abs/1511.05547
14. Ganin, Y., & Lempitsky, V. (2015). Unsupervised Domain Adaptation by Backpropagation. *ICML*. arXiv:1409.7495. https://arxiv.org/abs/1409.7495
15. Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., & Lempitsky, V. (2016). Domain-Adversarial Training of Neural Networks. *JMLR*, 17(59), 1-35. https://jmlr.org/papers/v17/15-239.html
16. Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial Discriminative Domain Adaptation (ADDA). *CVPR*. arXiv:1702.05464. https://arxiv.org/abs/1702.05464
17. Long, M., Cao, Z., Wang, J., & Jordan, M. I. (2018). Conditional Adversarial Domain Adaptation. *NeurIPS*. arXiv:1705.10667. https://arxiv.org/abs/1705.10667
18. Saito, K., Watanabe, K., Ushiku, Y., & Harada, T. (2018). Maximum Classifier Discrepancy for Unsupervised Domain Adaptation. *CVPR*. arXiv:1712.02560. https://arxiv.org/abs/1712.02560
19. Hoffman, J., Tzeng, E., Park, T., Zhu, J.-Y., Isola, P., Saenko, K., Efros, A., & Darrell, T. (2018). CyCADA: Cycle-Consistent Adversarial Domain Adaptation. *ICML*. arXiv:1711.03213. https://arxiv.org/abs/1711.03213
20. Courty, N., Flamary, R., Tuia, D., & Rakotomamonjy, A. (2016). Optimal Transport for Domain Adaptation. *IEEE TPAMI*. arXiv:1507.00504. https://arxiv.org/abs/1507.00504
21. Courty, N., Flamary, R., Habrard, A., & Rakotomamonjy, A. (2017). Joint Distribution Optimal Transportation for Domain Adaptation. *NeurIPS*. arXiv:1705.08848. https://arxiv.org/abs/1705.08848
22. Damodaran, B. B., Kellenberger, B., Flamary, R., Tuia, D., & Courty, N. (2018). DeepJDOT: Deep Joint Distribution Optimal Transport for Unsupervised Domain Adaptation. *ECCV*. arXiv:1803.10081. https://arxiv.org/abs/1803.10081
23. Liang, J., Hu, D., & Feng, J. (2020). Do We Really Need to Access the Source Data? Source Hypothesis Transfer for Unsupervised Domain Adaptation (SHOT). *ICML*. arXiv:2002.08546. https://arxiv.org/abs/2002.08546
24. Wang, D., Shelhamer, E., Liu, S., Olshausen, B., & Darrell, T. (2021). Tent: Fully Test-Time Adaptation by Entropy Minimization. *ICLR*. arXiv:2006.10726. https://arxiv.org/abs/2006.10726
25. Zhang, M., Levine, S., & Finn, C. (2022). MEMO: Test Time Robustness via Adaptation and Augmentation. *NeurIPS*. arXiv:2110.09506. https://arxiv.org/abs/2110.09506
26. Iwasawa, Y., & Matsuo, Y. (2021). Test-Time Classifier Adjustment Module for Model-Agnostic Domain Generalization (T3A). *NeurIPS*. https://proceedings.neurips.cc/paper/2021/hash/1415fe9fea0f8d63bb1afeacedf6dad9-Abstract.html
27. Muandet, K., Balduzzi, D., & Schölkopf, B. (2013). Domain Generalization via Invariant Feature Representation. *ICML*. arXiv:1301.2115. https://arxiv.org/abs/1301.2115
28. Arjovsky, M., Bottou, L., Gulrajani, I., & Lopez-Paz, D. (2019). Invariant Risk Minimization. arXiv:1907.02893. https://arxiv.org/abs/1907.02893
29. Sagawa, S., Koh, P. W., Hashimoto, T. B., & Liang, P. (2020). Distributionally Robust Neural Networks for Group Shifts (GroupDRO). *ICLR*. arXiv:1911.08731. https://arxiv.org/abs/1911.08731
30. Zhou, K., Yang, Y., Qiao, Y., & Xiang, T. (2021). Domain Generalization with MixStyle. *ICLR*. arXiv:2104.02008. https://arxiv.org/abs/2104.02008
31. Gulrajani, I., & Lopez-Paz, D. (2021). In Search of Lost Domain Generalization. *ICLR*. arXiv:2007.01434. https://arxiv.org/abs/2007.01434
32. Saenko, K., Kulis, B., Fritz, M., & Darrell, T. (2010). Adapting Visual Category Models to New Domains. *ECCV*. https://link.springer.com/chapter/10.1007/978-3-642-15561-1_16
33. Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., & Wang, B. (2019). Moment Matching for Multi-Source Domain Adaptation. *ICCV*. arXiv:1812.01754. https://arxiv.org/abs/1812.01754
34. Koh, P. W., et al. (2021). WILDS: A Benchmark of in-the-Wild Distribution Shifts. *ICML*. arXiv:2012.07421. https://arxiv.org/abs/2012.07421
35. Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., & Abbeel, P. (2017). Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. *IROS*. arXiv:1703.06907. https://arxiv.org/abs/1703.06907
36. Pan, S. J., & Yang, Q. (2010). A Survey on Transfer Learning. *IEEE TKDE*, 22(10), 1345-1359. https://ieeexplore.ieee.org/document/5288526
37. Venkateswara, H., Eusebio, J., Chakraborty, S., & Panchanathan, S. (2017). Deep Hashing Network for Unsupervised Domain Adaptation (Office-Home dataset). *CVPR*. arXiv:1706.07522. https://arxiv.org/abs/1706.07522
38. Li, D., Yang, Y., Song, Y.-Z., & Hospedales, T. M. (2017). Deeper, Broader and Artier Domain Generalization (PACS dataset). *ICCV*. arXiv:1710.03077. https://arxiv.org/abs/1710.03077
39. Csurka, G. (2017). Domain Adaptation for Visual Applications: A Comprehensive Survey. arXiv:1702.05374; in *Domain Adaptation in Computer Vision Applications*, Springer. https://arxiv.org/abs/1702.05374
40. Wang, M., & Deng, W. (2018). Deep Visual Domain Adaptation: A Survey. *Neurocomputing*, 312, 135-153. arXiv:1802.03601. https://arxiv.org/abs/1802.03601
41. You, K., Long, M., Cao, Z., Wang, J., & Jordan, M. I. (2019). Universal Domain Adaptation. *CVPR*. https://openaccess.thecvf.com/content_CVPR_2019/html/You_Universal_Domain_Adaptation_CVPR_2019_paper.html
42. Panareda Busto, P., & Gall, J. (2017). Open Set Domain Adaptation. *ICCV*. https://openaccess.thecvf.com/content_iccv_2017/html/Busto_Open_Set_Domain_ICCV_2017_paper.html