Domain adaptation

Domain adaptation is the subfield of transfer learning that addresses distribution shift between a source domain, where labelled training data is plentiful, and a target domain, where labels are scarce or absent. The goal is to train a model that performs well on the target distribution despite the shift. Standard supervised machine learning assumes that training and test data are drawn independently and identically from the same distribution; domain adaptation relaxes this i.i.d. assumption and tries to characterise, bound, and reduce the loss in target performance that follows.

The field has been shaped by a foundational generalisation bound from Ben-David et al. (2010), a wave of deep adversarial and discrepancy-based methods between 2015 and 2018, and a more recent shift toward source-free, test-time, and foundation-model-era robustness.

Why domain adaptation matters

Real-world deployment rarely matches the training distribution. A vision model trained on stock photography may be evaluated on smartphone images. A medical imaging model trained at one hospital may be deployed at another with a different scanner brand and patient demographic. A driving model trained in California will not see the same lighting or signage in Berlin. Failure to adapt produces silent accuracy drops that are hard to detect post-deployment.

This is not a theoretical concern. Recht, Roelofs, Schmidt, and Shankar (2019) carefully reproduced the ImageNet test-set construction process and found that a wide range of state-of-the-art classifiers lost 11 to 14 percent accuracy on the new test set, even though the new images were drawn from the same Flickr pool by the same protocol. The minutiae of dataset collection can produce a measurable distribution shift that hurts every model.

Foundation-model fine-tuning is implicitly a domain adaptation problem: a pre-trained model encodes a broad source distribution and is adapted to a narrower target task with much less data, often using parameter-efficient methods such as LoRA.

Types of distribution shift

Domain adaptation literature distinguishes several flavours of shift between $P_S(X, Y)$ and $P_T(X, Y)$, each with different identifiability and different algorithmic remedies.

Shift type	Decomposition	What changes	Canonical example
Covariate shift	$P(Y\|X)$ preserved, $P(X)$ differs	Inputs are sampled differently	Different camera or sensor; different population sampling
Label / prior shift	$P(X\|Y)$ preserved, $P(Y)$ differs	Class frequencies change	Disease prevalence differs between hospitals
Concept shift / drift	$P(Y\|X)$ differs	Labelling rule changes	User preferences shift over time; sentiment classes redefined
Subpopulation shift	Target is a sub-distribution of source	Mass concentrates on a slice	Deploy a model trained on all users only on a minority group
Domain shift	General; both marginals and conditionals may move	Different data-generating processes	Photograph vs sketch; English vs Spanish text

Covariate shift is the most theoretically tractable case because labels can be unbiasedly estimated by importance weighting if $P_T(x) / P_S(x)$ is known. Concept shift is the hardest: without target labels there is no information about how $P(Y \mid X)$ has moved. Deep-learning DA usually addresses domain shift, where multiple components change simultaneously.

Theoretical foundations

The theoretical core is the generalisation bound of Ben-David et al., introduced in Analysis of Representations for Domain Adaptation (NeurIPS 2007) and extended in A theory of learning from different domains (Machine Learning, 2010). For a hypothesis class $\mathcal{H}$ and a hypothesis $h$, the target error is bounded by

$$\varepsilon_T(h) \le \varepsilon_S(h) + d_{\mathcal{H} \triangle \mathcal{H}}(D_S, D_T) + \lambda$$

where $\varepsilon_S(h)$ is the source error, $d_{\mathcal{H} \triangle \mathcal{H}}$ is the symmetric-difference $\mathcal{H}$-divergence between input marginals, and $\lambda$ is the error of the best joint hypothesis on both domains. The bound is intuitive: a model performs well on the target if it performs well on the source, the input distributions are not too different from the hypothesis class's perspective, and at least one good hypothesis exists for both domains.

The $\mathcal{H}$-divergence is finite-sample estimable from unlabelled data and can be approximated by training a domain classifier: if a model can easily distinguish source from target features, the divergence is large. This estimator is the conceptual seed of every adversarial method below.

Several other strands have grown from this foundation:

Importance weighting under covariate shift. Sugiyama, Krauledat, and Müller (2007) proposed importance-weighted cross validation, and Sugiyama et al. (2008) introduced KLIEP (Kullback-Leibler Importance Estimation Procedure). Huang et al. (2007) introduced Kernel Mean Matching (KMM). Both estimate $w(x) = P_T(x) / P_S(x)$ directly.
PAC-Bayes for domain adaptation. Germain, Habrard, Laviolette, and Morvant (2013) extended the PAC-Bayes framework to the adaptation setting.
Limits of unsupervised adaptation. Ben-David and Urner (2014) showed that without strong assumptions, no algorithm can guarantee target accuracy from labelled source and unlabelled target alone. Some structural assumption, typically the existence of a low-error joint hypothesis, is unavoidable.
Tighter bounds for label shift. Tachet des Combes, Zhao, Wang, and Gordon (2020) proposed Generalised Label Shift assumptions and a corresponding correction that improves over plain feature alignment when label proportions also differ.

Setting variants

The field uses a fairly fine-grained vocabulary for the precise data and label availability at adaptation time.

Setting	Source data	Target data	Labels	Defining paper
Unsupervised DA (UDA)	Labelled, available	Unlabelled, available	Source only	Ganin & Lempitsky 2015
Semi-supervised DA	Labelled, available	Mostly unlabelled, few labelled	Source plus a few target	Saito et al. 2019
Few-shot DA	Labelled, available	Few labelled examples per class	Source plus K target	Motiian et al. 2017
Multi-source DA	Multiple labelled source domains	Unlabelled	Sources only	Peng et al. 2019
Open-set DA	Labelled, with C classes	Unlabelled, with classes possibly outside C	Source only	Panareda Busto & Gall 2017
Universal DA	Labelled, with possible private classes	Unlabelled, with possible private classes	Source only	You et al. 2019
Source-free DA (SFDA)	Pre-trained model only, no source data	Unlabelled, available	Implicit in source model	Liang et al. 2020 (SHOT)
Test-time adaptation (TTA)	Pre-trained model only	Streaming target examples at inference	None	Wang et al. 2021 (TENT)
Domain generalisation (DG)	Multiple source domains	None during training	Sources only	Muandet, Balduzzi, Schölkopf 2013

Unsupervised DA is the dominant academic setting because it captures the essential difficulty (no target labels). Source-free DA reflects realistic deployment, where the source data may be proprietary, regulated, or too large to ship to the deployment site. Domain generalisation drops target access entirely. In practice these settings shade into one another.

Methods

The methods landscape is large but organises into a small number of families.

Family	Idea	Representative method	Year
Importance reweighting	Match source marginal to target marginal by weighting samples	KMM (Huang et al.); KLIEP (Sugiyama et al.)	2007 / 2008
Discrepancy-based feature alignment	Minimise a statistical distance (MMD, CORAL, CMD) between source and target features	DDC (Tzeng et al. 2014); DAN (Long et al. 2015); JAN (Long et al. 2017); Deep CORAL (Sun & Saenko 2016)	2014 to 2017
Adversarial alignment	Train a domain classifier and force features to fool it	DANN (Ganin & Lempitsky 2015); ADDA (Tzeng et al. 2017); CDAN (Long et al. 2018); MCD (Saito et al. 2018)	2015 to 2018
Generative / image-translation	Translate source images to target style with a GAN	CyCADA (Hoffman et al. 2018)	2018
Optimal transport	Align joint feature-label distributions via Wasserstein distance	OTDA (Courty et al. 2016); JDOT (Courty et al. 2017); DeepJDOT (Damodaran et al. 2018)	2016 to 2018
Self-training / pseudo-labelling	Generate confident target pseudo-labels and retrain	SHOT (Liang et al. 2020); FixMatch-style hybrids	2019 onward
Test-time adaptation	Adapt batch-norm or affine parameters at inference	TENT (Wang et al. 2021); MEMO (Zhang et al. 2022); T3A (Iwasawa & Matsuo 2021)	2020 onward
Domain generalisation	Train on multiple sources to be robust to unseen target	DICA (Muandet et al. 2013); IRM (Arjovsky et al. 2019); GroupDRO (Sagawa et al. 2020); MixStyle (Zhou et al. 2021)	2013 onward

Discrepancy-based feature alignment

Maximum Mean Discrepancy (MMD) is a kernel-based distance between distributions, measuring the difference between mean embeddings of two samples in a Reproducing Kernel Hilbert Space (Gretton et al. 2007). Long, Cao, Wang, and Jordan (2015) introduced Deep Adaptation Networks (DAN), adding a multi-kernel MMD penalty across multiple task-specific layers of a CNN. Their follow-up Joint Adaptation Networks (JAN, ICML 2017) align the joint distribution of multiple layers rather than marginals layer-by-layer.

Sun and Saenko (2016) proposed Deep CORAL in an ECCV 2016 workshop paper. CORAL aligns the second-order statistics of source and target features by minimising the Frobenius distance between covariance matrices; Deep CORAL extends this to a differentiable layer. The authors call CORAL a "frustratingly easy" method because the loss is a single line of code, yet it matches more elaborate methods on several benchmarks.

Adversarial methods

Adversarial alignment turns the $\mathcal{H}$-divergence into an explicit optimisation. Ganin and Lempitsky (2015) introduced Domain-Adversarial Neural Networks (DANN) at ICML, with a journal version by Ganin et al. (2016) in JMLR. DANN attaches a domain classifier to the feature extractor through a gradient reversal layer (GRL) that multiplies the gradient by a negative constant during backpropagation. The feature extractor is trained to maximise the domain classifier's loss, producing features that are discriminative for the source task but indistinguishable across domains. The method needs no architectural changes other than the GRL.

Tzeng, Hoffman, Saenko, and Darrell (2017) generalised this idea in Adversarial Discriminative Domain Adaptation (ADDA) at CVPR. ADDA uses a two-stage scheme with untied weights between source and target encoders and a standard GAN-style adversarial loss, simpler and more effective on cross-modality tasks such as adapting from RGB to depth.

Long, Cao, Wang, and Jordan (2018) introduced Conditional Domain Adversarial Networks (CDAN) at NeurIPS, which condition the domain discriminator on the cross-covariance between feature representations and classifier predictions. CDAN aligns the joint distribution rather than just the marginal, which helps when the data are multimodal in the class sense.

Saito, Watanabe, Ushiku, and Harada (2018) proposed Maximum Classifier Discrepancy (MCD) at CVPR. MCD trains two task classifiers on top of a shared feature extractor and alternates between maximising and minimising their disagreement on target samples, using the discrepancy as a proxy for whether target features lie inside the source classifier's well-supported region.

Generative, optimal-transport, and self-training methods

Hoffman et al. (2018) introduced CyCADA (Cycle-Consistent Adversarial Domain Adaptation) at ICML, which adapts at both the pixel level and the feature level, building on CycleGAN to translate source images into target style with a semantic consistency loss. The approach was particularly effective on synthetic-to-real urban driving (GTA to Cityscapes).

Courty, Flamary, Tuia, and Rakotomamonjy introduced OTDA in TPAMI 2016, transporting source samples to the target via an entropy-regularised Wasserstein problem. Courty, Flamary, Habrard, and Rakotomamonjy followed with JDOT (Joint Distribution Optimal Transport) at NeurIPS 2017, which aligns joint feature-label distributions and learns the prediction function and transport plan together. DeepJDOT (Damodaran et al. 2018) extended this to deep networks.

Self-training adapts a model by using its confident predictions on target data as pseudo-labels for further training. Liang, Hu, and Feng (2020) introduced SHOT (Source HypOthesis Transfer) at ICML, which freezes the source classifier and learns a new feature extractor for the target by combining information maximisation with self-supervised pseudo-labelling. SHOT does not need source data at adaptation time, which matters when source data is private or proprietary.

Test-time adaptation

Wang, Shelhamer, Liu, Olshausen, and Darrell (2021) introduced TENT (Test-time Entropy Minimisation) at ICLR 2021. TENT updates only the channel-wise affine parameters of batch-norm layers at test time, by minimising the entropy of the model's predictions on the test batch. The method has only the test data and the trained model, no source data. TENT reduced error substantially on ImageNet-C, the corruption benchmark, and worked on segmentation and digit-recognition adaptation. Subsequent work in this line includes MEMO (Zhang, Levine & Finn 2022) and T3A (Iwasawa & Matsuo 2021), which adjusts only the classifier's prototypes at test time.

Domain generalisation methods

Domain generalisation removes target access entirely. Muandet, Balduzzi, and Schölkopf (2013) introduced Domain-Invariant Component Analysis (DICA), a kernel method that learns a transformation minimising cross-domain dissimilarity. Modern DG methods include Invariant Risk Minimisation (IRM, Arjovsky et al. 2019), Group Distributionally Robust Optimisation (GroupDRO, Sagawa et al. 2020), and style-mixing methods such as MixStyle (Zhou et al. 2021).

A notable empirical finding came from Gulrajani and Lopez-Paz (2021) at ICLR. Their DomainBed benchmark unified 7 datasets, 14 algorithms, and 3 model selection criteria. They reported that, when tuned with the same hyperparameter budget, plain Empirical Risk Minimisation (ERM) was competitive with all proposed DG methods, and no method beat ERM by more than one point on average. This has reshaped how the community evaluates DG.

Standard benchmarks

Progress in the field is measured on a small set of canonical datasets.

Benchmark	Year	Domains	Classes	Notes
Office-31	2010	3 (Amazon, Webcam, DSLR)	31	Saenko, Kulis, Fritz & Darrell at ECCV; 4,652 office object images; standard small-scale UDA
Office-Caltech-10	2012	4 (Amazon, Webcam, DSLR, Caltech)	10	Gong et al. extension of Office-31
Office-Home	2017	4 (Art, Clipart, Product, Real-World)	65	Venkateswara et al.; ~15,500 images; medium-scale
VisDA-2017	2017	2 (synthetic, real)	12	Peng et al. challenge dataset; large synthetic-to-real
DomainNet	2019	6 (Clipart, Infograph, Painting, Quickdraw, Real, Sketch)	345	Peng et al. at ICCV; ~600,000 images; largest visual DA benchmark
PACS	2017	4 (Photo, Art painting, Cartoon, Sketch)	7	Li, Yang, Song & Hospedales at ICCV; 9,991 images; main DG benchmark
DomainBed	2021	varies (PACS, OfficeHome, VLCS, TerraIncognita, etc.)	varies	Gulrajani & Lopez-Paz unified DG testbed
WILDS	2021	10 datasets in real domains	varies	Koh et al. at ICML; tumour images, wildlife camera traps, satellite imagery, code, text

WILDS aims explicitly at real-world shifts. Each of its constituent datasets contains a natural domain split: different hospitals (Camelyon17 tumour identification), different camera traps (iWildCam), different times and locations (FMoW satellite imagery), or different countries (PovertyMap). The authors report a consistent gap between in-distribution and out-of-distribution accuracy on every dataset that persists even with state-of-the-art DA methods.

Applications

Domain adaptation shows up wherever deployment differs from training.

Medical imaging. Models trained at one hospital systematically underperform at others because of scanner brand, acquisition protocol, demographics, and disease prevalence. Adaptation methods are routinely benchmarked on tumour classification (Camelyon17), chest X-rays, and retinal imaging.
Autonomous driving and sim-to-real. Driving policies and perception modules are typically trained in simulation (CARLA, GTA, AirSim) and adapted to real video. Tobin, Fong, Ray, Schneider, Zaremba, and Abbeel (2017) introduced domain randomisation at IROS, which randomises non-essential simulator properties (lighting, textures, distractors) so that the real world looks like just another simulated variant. It is now one of the dominant sim-to-real techniques in robotics.
Robotics manipulation. Grasp models trained in simulation are adapted to physical robots through randomisation, image translation, or progressive networks.
Cross-domain NLP. Sentiment analysis from movie reviews to product reviews; biomedical NER; cross-lingual transfer for low-resource languages.
Speech recognition. Cross-accent, cross-microphone, and cross-language ASR adaptation. Modern ASR combines self-supervised pretraining (wav2vec 2.0, HuBERT) with task-specific fine-tuning.
Recommender systems. Cross-domain recommendation, where interactions in a source domain inform predictions in a target domain for the same users.
Industrial fault diagnosis and remote sensing. Adaptation across machines, satellite sensors, geographic regions, and time of acquisition.

Domain adaptation also plays a role in fairness work: subpopulation shift formalises the accuracy gap between majority and minority groups, and methods such as GroupDRO can be cast as DA between subpopulations.

Connection to broader concepts

Transfer learning is the umbrella concept; DA is the case where source and target tasks are the same but input distributions differ. Pan and Yang (2010) is the canonical survey.
Multi-task learning trains on multiple tasks at once, which can implicitly help DA.
Continual / lifelong learning addresses sequential domain shifts, where the model must adapt without forgetting.
Self-supervised pre-training. Models such as CLIP, DINO, and DINOv2 produce representations that often transfer well to new domains. Empirical work since 2021 has shown that these foundation models often beat purpose-built DA methods on standard benchmarks because their source distribution is broad.
Foundation models and prompt-based adaptation. Few-shot prompting and parameter-efficient adaptation (LoRA, prefix tuning, adapters) are forms of DA that exploit a strong pre-trained representation.
Robustness to distribution shift. A nearby literature studies natural and adversarial robustness, including the Recht et al. study and work on adversarial attacks and corruption robustness (ImageNet-C).

Practical considerations

Beyond the algorithm zoo, there are several recurring practical issues that often matter more than the choice of method.

Hyperparameter selection without target labels. With no labelled target data, standard cross-validation cannot be used. Workarounds include source validation, importance-weighted validation, and a small held-out target set, but Gulrajani and Lopez-Paz showed that this choice can swing reported accuracies by several points.
Unfair comparisons. Papers historically used different backbones, augmentation, and training budgets, so apparent improvements on Office-31 sometimes reflected those rather than the proposed method. DomainBed and WILDS were designed to fix this.
Negative transfer. When source and target are too different, DA can hurt rather than help. The $\lambda$ term in the Ben-David bound formalises this: if no joint hypothesis exists, no algorithm will succeed.
Data efficiency vs collecting target labels. Few-shot adaptation typically beats elaborate UDA once a few hundred target labels are available, especially with a strong pre-trained backbone.
Foundation-model fine-tuning often beats classical DA. A 2023-2024 line of work has shown that supervised fine-tuning of a CLIP backbone on the source matches or exceeds the best UDA methods on Office-Home and DomainNet without any target data at all.
Batch-norm statistics matter. Many gains in test-time adaptation come from re-estimating batch-norm statistics on target data alone, which is essentially free.

Open challenges

Several problems are active research areas as of 2026.

Theoretical guarantees with deep networks. The classical bounds assume a fixed hypothesis class; deep models effectively learn the hypothesis class, which complicates the divergence terms.
Source-free and federated DA. As privacy and data-residency rules tighten, shipping source data to deployment sites is often impossible. Source-free methods such as SHOT and federated DA address this.
Test-time adaptation in non-stationary streams. TENT and successors assume i.i.d. test batches; real deployment streams may include adversarial or out-of-distribution inputs.
Multi-modal DA. Adapting models that combine vision, language, and other modalities introduces new shifts, since each modality may move independently.
Foundation-model robustness. Quantifying when and why a CLIP-style backbone generalises across domains, and where it still fails (medical imaging, low-resource languages, specialised industrial sensors).
Fairness during adaptation. Adaptation methods can preserve or exacerbate disparities across subpopulations.
Continual / lifelong DA. Making models that adapt to new domains without forgetting old ones and without access to old source data.

The classical UDA setting with Office-31 and CNNs trained from scratch is now mostly a teaching example. The current frontier is about source-free and test-time methods, benchmarks like WILDS that resemble real deployment, and how foundation models change the meaningful baselines.

Domain adaptation

Domain adaptation

Why domain adaptation matters

Types of distribution shift

Theoretical foundations

Setting variants

Methods

Discrepancy-based feature alignment

Adversarial methods

Generative, optimal-transport, and self-training methods

Test-time adaptation

Domain generalisation methods

Standard benchmarks

Applications

Connection to broader concepts

Practical considerations

Open challenges

References

Improve this article

Domain adaptation

Why domain adaptation matters

Types of distribution shift

Theoretical foundations

Setting variants

Methods

Discrepancy-based feature alignment

Adversarial methods

Generative, optimal-transport, and self-training methods

Test-time adaptation

Domain generalisation methods

Standard benchmarks

Applications

Connection to broader concepts

Practical considerations

Open challenges

References

Domain adaptation

Why domain adaptation matters

Types of distribution shift

Theoretical foundations

Setting variants

Methods

Discrepancy-based feature alignment

Adversarial methods

Generative, optimal-transport, and self-training methods

Test-time adaptation

Domain generalisation methods

Standard benchmarks

Applications

Connection to broader concepts

Practical considerations

Open challenges

References

Improve this article

Related Articles

Pre-training

Supervised fine-tuning

ARC-AGI 2

Fine Tuning

Pre-Trained Model

AUC-ROC

Domain adaptation

Why domain adaptation matters

Types of distribution shift

Theoretical foundations

Setting variants

Methods

Discrepancy-based feature alignment

Adversarial methods

Generative, optimal-transport, and self-training methods

Test-time adaptation

Domain generalisation methods

Standard benchmarks

Applications

Connection to broader concepts

Practical considerations

Open challenges

References

Related Articles

Pre-training

Supervised fine-tuning

ARC-AGI 2

Fine Tuning

Pre-Trained Model

AUC-ROC