See also: Machine learning terms, Probability, Statistics, Distribution shift, Out-of-distribution detection
Introduction
In probability theory and statistics, a collection of random variables is said to be independently and identically distributed (abbreviated i.i.d., iid, or IID) if each variable has the same probability distribution as the others and all variables are mutually independent. This concept is one of the most important foundational assumptions across machine learning, statistical inference, and data science. Nearly every classical algorithm in supervised and unsupervised learning relies, either explicitly or implicitly, on the assumption that the data points in a dataset are i.i.d. samples drawn from some underlying distribution.
The i.i.d. assumption simplifies the mathematics of learning and inference considerably. It allows the joint probability of an entire dataset to be expressed as the product of individual probabilities, which makes optimization tractable and enables powerful theoretical guarantees about generalization. However, many real-world datasets violate this assumption, and understanding when and how i.i.d. breaks down is essential for building reliable models. Modern research on distribution shift, out-of-distribution (OOD) detection, domain adaptation, and federated learning all begin from the observation that the i.i.d. picture is too clean for the messy data flows that real systems encounter in deployment.
Overview
The phrase "i.i.d." packs together two ideas that are usually taught separately. Independence is about the absence of any informational link between observations: knowing one tells you nothing about another. Identical distribution is about the absence of any drift or selection effect: every observation is produced by the same underlying mechanism. The combination produces the cleanest mathematical setting imaginable. Joint probabilities factor, sums of random variables behave well under limit theorems, estimators are consistent under mild moment conditions, and empirical averages converge to expectations. Most introductory probability and statistics courses assume i.i.d. throughout, often without saying so explicitly.
This article describes what i.i.d. means formally, traces its historical roots from Jacob Bernoulli to Andrey Kolmogorov, walks through why the assumption is so heavily used in classical statistics and modern machine learning, and then surveys the large body of work that has emerged specifically because real data is rarely i.i.d. Topics covered include the central limit theorem, the law of large numbers, PAC learning, the Vapnik-Chervonenkis (VC) dimension, Rademacher complexity, exchangeability and de Finetti's theorem, the formal taxonomy of dataset shift, importance weighting under covariate shift, label-shift correction, concept drift, OOD detection (softmax baseline, ODIN, energy scores, Deep SVDD), OOD generalization benchmarks (DomainBed, WILDS), conformal prediction, federated learning over non-i.i.d. clients, causal inference, and practical strategies for splitting non-i.i.d. data.
The term "independently and identically distributed" combines two distinct mathematical properties.
Independence
A set of random variables X₁, X₂, ..., Xₙ is independent if the realization of any one variable provides no information about the others. Formally, two random variables X and Y are independent if their joint cumulative distribution function (CDF) equals the product of their individual CDFs:
F(X, Y)(x, y) = F(X)(x) * F(Y)(y) for all x, y
Equivalently, in terms of probabilities:
P(X in A, Y in B) = P(X in A) * P(Y in B) for all events A and B
For a full collection of n variables, this factorization must hold for every possible subset, not just for pairs. A weaker condition called pairwise independence only requires the factorization to hold for pairs, and pairwise independence does not imply full mutual independence. The standard i.i.d. assumption used in statistics and machine learning is the stronger mutual independence.
Identically distributed
Random variables are identically distributed if they all share the same probability distribution. Formally, X₁ and X₂ are identically distributed if:
F(X₁)(x) = F(X₂)(x) for all x
This means that the mechanism generating each data point is the same. There are no trends, shifts, or changes in the distribution over time or across samples. Random variables that are independent but not identically distributed are sometimes written as i.n.i.d. and arise often in survey sampling and meta-analysis. Random variables that share a distribution but are dependent are common in time series and spatial statistics.
Combining both properties
A sequence of random variables is i.i.d. if and only if both conditions hold simultaneously. Each observation must be drawn from the same distribution, and the draw of one observation must have no effect on any other. When both conditions are satisfied, the joint probability of the entire dataset decomposes into a simple product:
P(X₁, X₂, ..., Xₙ) = P(X₁) * P(X₂) * ... * P(Xₙ)
This factorization is the key property that makes statistical analysis and maximum likelihood estimation computationally feasible. Without it, every joint density would, in principle, need to be modeled as a function of every other observation.
Notation
The shorthand X₁, ..., Xₙ ~ F i.i.d. is read as "X-one through X-n are i.i.d. samples from the distribution F." The notation X₁, ..., Xₙ ~ p i.i.d. uses a density or mass function p in place of F. In machine learning, the standard supervised learning setup writes the training set as {(X₁, Y₁), ..., (Xₙ, Yₙ)} ~ P i.i.d., where P is the unknown joint distribution over inputs and labels.
Historical context
The formal concept of i.i.d. emerged gradually over three centuries of work on probability and statistics.
Bernoulli and the first law of large numbers
The earliest precursor of the i.i.d. assumption appears in Jacob Bernoulli's Ars Conjectandi, written between 1684 and 1690 and published posthumously in 1713. Bernoulli proved what he called his "golden theorem": that the proportion of successes in a long sequence of independent trials with the same success probability converges to that probability. This is the first version of the law of large numbers and the first nontrivial limit theorem in probability. Bernoulli's trials are now a textbook example of i.i.d. random variables.
Laplace and the central limit theorem
Pierre-Simon Laplace generalized Bernoulli's result in his 1812 Theorie analytique des probabilites and proved an early version of the central limit theorem for sums of i.i.d. random variables. Throughout the 19th century, mathematicians including Pafnuty Chebyshev, Andrey Markov, and Aleksandr Lyapunov developed sharper versions of the limit theorems and extended them to non-identically distributed sequences.
Kolmogorov and the modern axiomatic foundation
The modern formulation of probability theory came from Andrey Kolmogorov's Grundbegriffe der Wahrscheinlichkeitsrechnung (Foundations of the Theory of Probability), published in 1933. Kolmogorov axiomatized probability using measure theory, giving precise meaning to independence, identical distribution, and joint distributions of infinite sequences. The Kolmogorov extension theorem makes it possible to talk rigorously about infinite sequences of i.i.d. random variables, which is the foundation for asymptotic statistics.
De Finetti and exchangeability
In 1937, Bruno de Finetti published "La prevision: ses lois logiques, ses sources subjectives," based on lectures given at the Institut Henri Poincare in Paris in 1935. De Finetti introduced the weaker notion of exchangeability and proved that any infinite exchangeable sequence is a mixture of i.i.d. sequences. This result tied subjective Bayesian probability to the i.i.d. framework and remains foundational for Bayesian statistics and modern conformal prediction.
Statistical learning theory
In the 1970s, Vladimir Vapnik and Alexey Chervonenkis developed the uniform convergence theory that became the basis for statistical learning theory. Their results assume that training samples are drawn i.i.d. from an unknown distribution. Leslie Valiant's 1984 paper "A Theory of the Learnable" introduced the Probably Approximately Correct (PAC) framework, which also assumes i.i.d. samples. Together, these results gave machine learning a rigorous theoretical justification for inferring from finite training data.
Why i.i.d. matters
The i.i.d. assumption is not just a convenience. It is the load-bearing wall behind almost every formal result that justifies statistical inference and machine learning.
Law of large numbers
The law of large numbers (LLN) states that if X₁, X₂, ..., Xₙ are i.i.d. random variables with finite mean mu, then the sample mean converges to mu as n grows large. The weak law asserts convergence in probability; the strong law asserts almost-sure convergence. Without i.i.d., the sample mean can still converge under additional structure such as stationarity and ergodicity, but the simple proofs and tight rates require independence and identical distribution. The LLN is the formal justification for using sample averages as estimates of population means and underpins the entire framework of statistical estimation.
Central limit theorem
The central limit theorem (CLT) states that if X₁, X₂, ..., Xₙ are i.i.d. random variables with finite mean mu and finite variance sigma-squared, then the standardized sample mean converges in distribution to a standard normal distribution as n approaches infinity:
(X-bar - mu) / (sigma / sqrt(n)) converges to N(0, 1)
The CLT is the reason that many statistical methods assume approximate normality even when the underlying data is not normal. It explains why confidence intervals built around a sample mean use the t or z distribution. Extensions such as the Lindeberg-Levy CLT, the Lyapunov CLT, and the Berry-Esseen theorem relax some conditions (allowing variables that are independent but not identically distributed, or quantifying the rate of convergence), but the classical version requires full i.i.d.
Maximum likelihood factorization
Under i.i.d., the joint likelihood of a dataset factors into a product of marginal likelihoods, and the log-likelihood becomes a sum. This factorization is what makes maximum likelihood estimation computationally tractable. Modern deep learning still relies on this factorization: minimizing cross-entropy loss is equivalent to maximizing the log-likelihood of i.i.d. training examples.
Empirical risk approximates expected risk
Statistical learning theory is built on the idea that empirical risk on a training set is a good proxy for expected risk on new data. If training samples are i.i.d. from a distribution P, then the empirical risk R-hat(h) is an unbiased estimator of the true risk R(h) = E[loss(h(X), Y)], and the strong law of large numbers gives almost-sure convergence as the sample size grows. Uniform convergence bounds (in terms of VC dimension or Rademacher complexity) extend this from a single hypothesis to a hypothesis class, again under i.i.d. sampling. Without i.i.d., the empirical risk can be a badly biased estimator of expected risk.
Sufficient statistics and exponential families
In the theory of exponential families, the sufficient statistics for the parameters of the distribution are sums or averages of i.i.d. observations. The Fisher information for a sample of n i.i.d. observations equals n times the Fisher information for a single observation, which leads to the Cramer-Rao lower bound and the asymptotic efficiency of maximum likelihood estimators.
i.i.d. in machine learning
The i.i.d. assumption is deeply embedded in the foundations of machine learning. Most standard algorithms and evaluation procedures assume that data points in the training set and test set are i.i.d. samples from the same underlying distribution. This assumption matters for several interconnected reasons.
Mathematical tractability
When data is i.i.d., the likelihood function for the entire dataset factors into a product of individual likelihoods. This makes it possible to use gradient descent and other optimization methods to find model parameters that maximize the likelihood. Without this factorization, computing the joint probability of all observations would require modeling complex dependencies between every pair of data points.
Generalization guarantees
Statistical learning theory provides bounds on how well a model trained on a finite sample will perform on unseen data. These generalization bounds, including the Probably Approximately Correct (PAC) learning framework introduced by Leslie Valiant in 1984 and Vapnik-Chervonenkis (VC) theory developed in the 1970s, assume that training and test data are drawn i.i.d. from the same distribution. If this assumption holds, a model that performs well on the training data is likely to perform well on new data, provided the model is not too complex (avoiding overfitting).
A central result in this area is the equivalence between finite VC dimension and PAC learnability in the realizable setting: a hypothesis class is PAC learnable if and only if its VC dimension is finite, and the sample complexity scales linearly with the VC dimension. Vapnik proved that on data drawn i.i.d. from the same distribution as the training set, the test error is bounded by the training error plus a term involving the VC dimension and the sample size. Rademacher complexity, introduced for risk bounds by Peter Bartlett and Shahar Mendelson in 2002, provides data-dependent generalization bounds that are tighter than VC bounds for many practical hypothesis classes; these bounds also assume i.i.d. samples.
Model evaluation
Standard evaluation techniques such as cross-validation, holdout validation, and bootstrapping all assume i.i.d. data. In k-fold cross-validation, for example, the data is randomly partitioned into k subsets. The validity of this procedure depends on each data point being interchangeable, meaning no data point carries information about another. When the data is not i.i.d., random splitting can produce overly optimistic performance estimates because information can leak between folds.
Stochastic gradient descent
Stochastic gradient descent (SGD) is the workhorse optimizer for deep learning. Its convergence analysis assumes that the per-example gradients used in mini-batch updates are unbiased estimates of the population gradient, which holds when the mini-batches are sampled i.i.d. from the training set. Many practical training pipelines reshuffle the training set at the start of every epoch precisely to keep mini-batches close to i.i.d.
Algorithms that assume i.i.d. data
The following table summarizes common machine learning algorithms and how they rely on the i.i.d. assumption.
Exchangeability
Exchangeability is a weaker condition than i.i.d. that plays an important role in Bayesian statistics and in conformal prediction.
A sequence of random variables Z₁, Z₂, ..., Zₙ is exchangeable if the joint distribution is invariant under any permutation of the indices: for any permutation pi of {1, 2, ..., n}, the joint distribution of (Z₁, ..., Zₙ) equals the joint distribution of (Z_pi(1), ..., Z_pi(n)). The order of observations does not matter. Every i.i.d. sequence is exchangeable, but the converse is not true. For example, draws from a Polya urn model are exchangeable but not independent: each draw changes the composition of the urn, so future draws depend on past draws.
De Finetti's theorem
De Finetti's theorem (1937) provides a deep connection between exchangeability and i.i.d. The theorem states that any infinite sequence of exchangeable random variables can be represented as a mixture of i.i.d. sequences. More precisely, exchangeable variables are conditionally i.i.d. given some latent parameter. This result is foundational for Bayesian inference, where the unknown parameter theta is treated as a random variable with a prior distribution, and the data are modeled as i.i.d. given theta. The practical implication is that exchangeability is often a more realistic assumption than strict i.i.d. in Bayesian modeling. It allows for dependence between observations while still enabling tractable inference through the mixture representation.
Conformal prediction, introduced by Vladimir Vovk, Alex Gammerman, and Glenn Shafer in the late 1990s and elaborated in their 2005 book Algorithmic Learning in a Random World, builds finite-sample valid prediction intervals around any black-box predictor. The key assumption is exchangeability of the calibration and test points, not full i.i.d. Conformal prediction guarantees that the probability of error does not exceed the nominal level alpha, for any alpha and any conformal predictor, as long as the exchangeability assumption holds. Extensions including weighted conformal prediction (Tibshirani and colleagues, 2019) and "conformal prediction beyond exchangeability" (Barber and colleagues, 2023) handle covariate shift and bounded distribution drift.
Bayesian hierarchical models
In a Bayesian hierarchical model, parameters at lower levels are exchangeable given the higher-level hyperparameters. Random-effects models in mixed-effects regression are an applied form of this idea, often used for clustered data such as patients within hospitals or students within schools.
When i.i.d. breaks
Many real-world datasets do not satisfy the i.i.d. assumption. Recognizing these violations is critical for selecting appropriate modeling strategies.
Time series data
Time series data violates independence because observations at adjacent time steps are typically correlated (autocorrelation). Stock prices, weather measurements, sensor readings, and web traffic all exhibit temporal dependencies. The standard ARIMA, GARCH, and state-space models are all designed for non-i.i.d. sequences with explicit temporal structure. Using standard cross-validation on time series data can produce misleadingly high accuracy because future information leaks into the training set. Best practice is to use a temporal split, sometimes called forward chaining, or rolling-window evaluation.
Spatial data
Geographic and spatial data often exhibits spatial autocorrelation, where nearby locations have more similar values than distant ones. The phenomenon was famously stated by Waldo Tobler as the first law of geography: "everything is related to everything else, but near things are more related than distant things." Air quality measurements, housing prices, soil characteristics, and disease incidence at nearby locations tend to be correlated. Treating spatially correlated observations as independent can lead to underestimation of standard errors and inflated type I error rates. Specialized models include Gaussian processes with spatial kernels, kriging, and conditional autoregressive (CAR) models.
Network and graph data
When data points are connected through a social network, citation graph, or biological network, observations are not independent. A person's behavior on a social network is influenced by their connections. Standard i.i.d. methods applied to network data can yield misleading results because they ignore the relational structure. Graph neural networks (GNNs) explicitly model these dependencies. Statistical methods such as exponential random graph models (ERGMs) and stochastic block models also drop the i.i.d. assumption.
Hierarchical and clustered data
In medical studies, patients within the same hospital share latent characteristics. Students within the same school share teachers, curriculum, and peer effects. Observations within a group are more similar to each other than to observations from other groups, violating independence. Mixed-effects models, generalized estimating equations (GEEs), and multilevel models handle these structures explicitly. The intraclass correlation coefficient (ICC) quantifies how much variance comes from between-group versus within-group differences.
Streaming and non-stationary data
Production machine learning systems often process data that arrives sequentially and whose distribution drifts over time. Recommendation systems see new users and trending content; ad systems see new campaigns; fraud detection systems see new attack patterns. The standard i.i.d. picture (sample once from a fixed distribution) gives way to non-stationary streaming data, where neither identical distribution nor independence holds in the strict sense.
Active and online learning
Active learning algorithms select training examples that are most informative to label, deliberately breaking i.i.d. sampling: the selected examples are not a random sample from the input distribution. Stream-based active learning and uncertainty sampling can lead to biased datasets, which then requires importance-weighted estimators or other corrections. Online learning algorithms process data sequentially and update the model after each observation, often in settings where the data distribution may change.
Distribution shift
Even when observations within a dataset are independent, they may not be identically distributed in the same sense as a test set or future deployment data. Distribution shift is the umbrella term for any difference between the training distribution and the test (or deployment) distribution. The next section breaks distribution shift into formal subtypes.
Distribution shift taxonomy
Distribution shift has been studied under several names: dataset shift, sample selection bias, domain shift, concept drift, and others. The 2009 MIT Press book Dataset Shift in Machine Learning, edited by Quionero-Candela, Sugiyama, Schwaighofer, and Lawrence, is the canonical reference. A 2012 unifying paper by Jose Garcia Moreno-Torres and colleagues, "A unifying view on dataset shift in classification" (Pattern Recognition 45, pages 521 to 530), proposed the now-standard terminology that distinguishes covariate shift, prior probability shift, and concept shift based on which factor of the joint distribution P(X, Y) changes.
The following table summarizes the main types of distribution shift.
| Type | What changes | What stays the same | Typical example |
|---|
| Covariate shift | P(X) | P(Y|X) | Training on photos taken in good lighting, testing on photos taken at night |
| Label shift (prior probability shift) | P(Y) | P(X|Y) | Disease prevalence changes between training and deployment |
| Concept shift / concept drift | P(Y|X) | P(X) | Spam keywords evolve while word frequencies stay similar |
| Domain shift | Source and target domains differ in unknown ways | Often P(Y) | Training on synthetic images, testing on real photos |
| Sample selection bias | The mechanism that selects training data depends on Y | True joint P(X, Y) | Convenience sampling that overrepresents one class |
| Subpopulation shift | Mixture weights over subgroups change | Within-subgroup distributions | Hospital A trains on adults, hospital B uses the model on children |
Covariate shift
Covariate shift is the case where the marginal distribution of inputs P(X) changes between training and test, but the conditional P(Y|X) stays the same. The term was introduced by Hidetoshi Shimodaira in his 2000 paper "Improving predictive inference under covariate shift by weighting the log-likelihood function" (Journal of Statistical Planning and Inference). Shimodaira showed that under model misspecification, weighting each training example by the importance ratio w(x) = p_test(x) / p_train(x) gives an asymptotically optimal estimator with respect to the test distribution. Under a correctly specified model, ordinary unweighted maximum likelihood is already optimal asymptotically, but importance weighting can still reduce finite-sample bias.
Label shift
Label shift, also called prior probability shift, is the case where P(Y) changes but P(X|Y) stays the same. This is natural for diagnostic problems: the class-conditional appearance of a disease (P(X|Y)) is determined by biology and barely changes, while the prevalence P(Y) varies between hospitals, populations, and time periods. The 2018 paper "Detecting and correcting for label shift with black box predictors" by Zachary Lipton, Yu-Xiang Wang, and Alex Smola (ICML 2018) introduced Black Box Shift Estimation (BBSE), which estimates the test-time label distribution using only a black-box predictor and the predictor's confusion matrix on a held-out validation set. BBSE is consistent as long as the confusion matrix is invertible.
Concept drift
Concept drift refers to changes in P(Y|X) over time. In fraud detection, the patterns associated with fraudulent transactions evolve as criminals adapt their strategies. In customer-churn prediction, the relationship between features and churn changes as products and competitors evolve. The 2014 survey "A survey on concept drift adaptation" by Joao Gama, Indre Zliobaite, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia (ACM Computing Surveys, Volume 46, Article 44) categorizes drift detectors, adaptive learning strategies, and evaluation methodologies. Drift can be sudden (abrupt change), gradual (slow transition), incremental (small steps), or recurring (cyclic patterns).
Domain shift and domain adaptation
Domain shift is a catch-all term for any meaningful difference between a source domain (where labeled training data is plentiful) and a target domain (where the model will be deployed, and where labels may be scarce or absent). Domain adaptation methods attempt to bridge this gap. Unsupervised domain adaptation assumes only unlabeled examples from the target domain are available. Semi-supervised domain adaptation assumes a small labeled target set.
Detecting and handling distribution shift
Researchers and practitioners have developed many techniques for detecting distribution shift and correcting for it.
Detection
Statistical tests including the Kolmogorov-Smirnov test, the Population Stability Index (PSI), and Maximum Mean Discrepancy (MMD) compare empirical distributions to flag shift. Two-sample classifier tests train a classifier to distinguish source from target samples; if it succeeds significantly above chance, the distributions differ. Sequential change-point detection methods including CUSUM, Page-Hinkley, and ADWIN flag drift in streaming data.
Importance weighting
Under covariate shift, the Shimodaira (2000) importance-weighting estimator reweights each training example by w(x) = p_test(x) / p_train(x). Estimating these density ratios directly without estimating each density separately has been a substantial research focus; methods include Kullback-Leibler Importance Estimation Procedure (KLIEP, Sugiyama and colleagues 2007) and least-squares importance fitting. Importance-weighted cross-validation extends standard cross-validation to the covariate-shift setting (Sugiyama, Krauledat, and Muller, JMLR 2007).
Domain adaptation methods
Domain adaptation methods learn representations that align source and target distributions in some feature space.
| Method | Citation | Approach |
|---|
| TCA (Transfer Component Analysis) | Pan and colleagues 2011 | Project to a shared subspace where MMD is minimized |
| CORAL | Sun, Feng, and Saenko 2016 | Align second-order statistics (covariances) of source and target features |
| Deep CORAL | Sun and Saenko 2016, ECCV workshop | Same idea applied to deep features end-to-end |
| DANN | Ganin, Lempitsky, and colleagues 2016, JMLR | Adversarial training: features fool a domain classifier |
| ADDA | Tzeng and colleagues 2017, CVPR | Adversarial discriminative domain adaptation |
| MMD-based methods | Long and colleagues 2015, ICML | Minimize Maximum Mean Discrepancy across feature layers |
| CDAN | Long and colleagues 2018, NeurIPS | Conditional adversarial domain adaptation |
Test-time adaptation
Test-time adaptation methods update the model online during deployment, without access to labels. TENT (Test ENTropy minimization), introduced by Dequan Wang, Evan Shelhamer, and colleagues at ICLR 2021, adapts a pre-trained model at test time by minimizing the entropy of its predictions on each batch, updating only the channel-wise affine parameters of batch-norm layers. TENT reduces error on corrupted ImageNet and CIFAR-10/100 and reached state-of-the-art results on ImageNet-C at the time of publication. Other test-time methods include MEMO (Marginal Entropy Minimization with One test point) and Test-Time Training.
Continual learning
Continual learning addresses the challenge of learning from a non-stationary stream of data without forgetting previously learned knowledge (catastrophic forgetting). Methods include Elastic Weight Consolidation, replay buffers, gradient projection, and meta-learning for fast adaptation.
Out-of-distribution detection
Out-of-distribution (OOD) detection is the task of flagging inputs that come from a distribution different from the training distribution. The goal is for a deployed model to know when it does not know. This area exploded after 2017 and is now a standard part of safety-conscious machine learning systems.
Maximum softmax probability baseline
The 2017 paper "A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks" by Dan Hendrycks and Kevin Gimpel (ICLR 2017) established the simplest baseline: take the maximum softmax probability of the prediction as a confidence score, and threshold on it. Correctly classified in-distribution examples tend to have higher max softmax probability than incorrectly classified examples or OOD examples. The baseline is surprisingly hard to beat and remains a reference point for almost every later method.
ODIN
ODIN ("Enhancing the Reliability of Out-of-distribution Image Detection in Neural Networks") was introduced by Shiyu Liang, Yixuan Li, and R. Srikant at ICLR 2018. ODIN combines two ideas: temperature scaling of the softmax (with large T such as 1000) and adding a small adversarial perturbation to the input that increases the model's confidence in the predicted class. Both modifications tend to sharpen the separation between in-distribution and OOD scores. On DenseNet trained on CIFAR-10 with TinyImageNet as OOD, ODIN reduced the false positive rate at 95 percent true positive rate from 34.7 percent to 4.3 percent.
Mahalanobis and energy-based scores
Lee and colleagues (NeurIPS 2018) proposed using the Mahalanobis distance in the feature space of a trained network as an OOD score. Weitang Liu and colleagues introduced energy-based OOD detection at NeurIPS 2020, scoring inputs by the free-energy logsumexp of the logits instead of the softmax confidence. The energy score is theoretically aligned with the input density under the energy-based interpretation of a classifier and reduced the FPR-at-95-TPR baseline by 18.03 percent on a CIFAR-10 WideResNet.
Deep SVDD and one-class methods
Deep SVDD (Deep Support Vector Data Description), introduced by Lukas Ruff and colleagues at ICML 2018, learns a neural feature map under which normal data is concentrated inside a minimum-volume hypersphere; OOD points fall outside the sphere. Deep SVDD is purely one-class: it does not require any OOD examples at training time. Related anomaly-detection approaches include normalizing flows, autoencoders, and generative adversarial networks for density estimation.
OOD benchmarks
Common OOD benchmarks include the following.
| Benchmark | Source | OOD set | Notes |
|---|
| CIFAR-10 vs SVHN | CIFAR-10 | SVHN | Classic image OOD pair |
| CIFAR-10 vs CIFAR-100 | CIFAR-10 | CIFAR-100 | Near-OOD with overlapping semantics |
| ImageNet vs ImageNet-O | ImageNet-1k | ImageNet-O | Adversarially filtered OOD images |
| ImageNet-C | ImageNet-1k | 75 corruption types | Hendrycks and Dietterich, ICLR 2019 |
| OpenOOD | Various | Various | Unified codebase and leaderboards (Yang and colleagues 2022) |
ImageNet-C
ImageNet-C is a corruption-robustness benchmark introduced by Dan Hendrycks and Thomas Dietterich at ICLR 2019. It applies 15 common visual corruptions (Gaussian noise, blur, weather effects, JPEG compression, and others) at five severity levels each to the ImageNet validation set, producing 75 corrupted variants. ImageNet-C has become a standard testbed for robustness to natural distribution shift.
OOD generalization
OOD generalization is the goal of training a model that performs well on test distributions different from the training distribution. The 2010s saw a wave of methods based on invariance principles, including domain-adversarial training, invariant risk minimization (IRM, Arjovsky and colleagues 2019), and many others. The next question was whether these methods actually beat plain empirical risk minimization (ERM) when evaluated rigorously.
DomainBed
The 2020 paper "In Search of Lost Domain Generalization" by Ishaan Gulrajani and David Lopez-Paz (published at ICLR 2021) introduced DomainBed, a testbed containing seven multi-domain datasets, nine baseline algorithms, and three model selection criteria. The central finding was provocative: when carefully implemented and given fair model selection, empirical risk minimization shows state-of-the-art performance across all datasets. No algorithm in DomainBed outperforms ERM by more than one point under matched experimental conditions. The paper argued that any domain generalization algorithm without a principled model selection strategy should be regarded as incomplete.
WILDS benchmark
The 2021 paper "WILDS: A Benchmark of in-the-Wild Distribution Shifts" by Pang Wei Koh, Shiori Sagawa, Henrik Marklund, and many co-authors (ICML 2021) curated 10 datasets reflecting realistic distribution shifts: shifts across hospitals for tumor identification, across camera traps for wildlife monitoring, across time and location for satellite imaging and poverty mapping, across users for product reviews, and others. On every WILDS dataset, standard ERM shows a substantial gap between in-distribution and out-of-distribution performance, and this gap is not closed by existing distribution-shift methods. The benchmark is available at wilds.stanford.edu with an open-source package, default architectures, and leaderboards.
Fine-grained analysis of distribution shift
Olivia Wiles and colleagues at DeepMind published "A Fine-Grained Analysis on Distribution Shift" at ICLR 2022. They evaluated 19 distribution-shift methods across five categories on both synthetic and real-world datasets, training more than 85,000 models in total. The conclusions echo DomainBed: no single method dominates across the synthetic and real benchmarks, and good ERM training plus heavy augmentation and pre-training is often the strongest baseline.
Honest perspective on OOD generalization
The DomainBed and WILDS literature has produced a slightly humbling consensus: many of the gains reported by domain-generalization methods on small benchmarks disappear under fair evaluation. The methods that genuinely seem to help are pre-training on large external data, strong data augmentation, and ensembling. There is still room for principled methods, but the bar for a new technique is now empirical robustness across DomainBed-style suites, not point estimates on a single dataset.
Federated and decentralized learning
Federated learning trains models across many decentralized devices (such as smartphones) without centralizing the data. The term and the first practical algorithm were introduced by Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas in the 2017 AISTATS paper "Communication-Efficient Learning of Deep Networks from Decentralized Data." Their algorithm, FedAvg (Federated Averaging), repeatedly performs local SGD on each device and then averages model parameters at the server.
Non-i.i.d. data across clients
Each device holds a local dataset that reflects its user's behavior, so data across devices is typically non-i.i.d. One device might contain mostly photos of food while another contains mostly photos of pets; one user types in English while another types in Hindi. The result is that local gradients pull the model in different directions, slowing convergence and degrading the final model. McMahan and colleagues showed FedAvg is robust to mild non-i.i.d. heterogeneity, but later work documented sharper degradation in highly heterogeneous settings.
Methods for non-i.i.d. federated optimization
| Method | Citation | Idea |
|---|
| FedAvg | McMahan and colleagues 2017 AISTATS | Local SGD followed by parameter averaging |
| FedProx | Li and colleagues 2020 MLSys | Adds a proximal term that limits how far local updates can drift from the global model |
| SCAFFOLD | Karimireddy and colleagues 2020 ICML | Uses control variates to correct for client drift |
| FedNova | Wang and colleagues 2020 NeurIPS | Normalizes local updates by number of local steps |
| pFedMe | T. Dinh and colleagues 2020 NeurIPS | Personalized federated learning with Moreau envelopes |
| Ditto | Li and colleagues 2021 ICML | Joint global and personalized objectives |
Personalization and privacy
In the personalization line of work, each client trains a model that is partially personalized to its local distribution, which is a direct response to non-i.i.d. data. The 2021 survey "Advances and Open Problems in Federated Learning" by Peter Kairouz, Brendan McMahan, and dozens of co-authors (Foundations and Trends in Machine Learning, Volume 14, pages 1 to 210) covers the broader landscape: privacy via differential privacy and secure aggregation, robustness to malicious clients, communication efficiency, and the open problems around non-i.i.d. data.
Causal inference and i.i.d.
Causal inference uses statistical data to estimate causal effects rather than mere associations. The standard i.i.d. assumption is rarely sufficient by itself. Causal inference adds further assumptions, including the stable unit treatment value assumption (SUTVA), unconfoundedness or ignorability, and positivity (overlap).
Judea Pearl's structural causal models (SCMs) treat each observation as generated by the same causal mechanism, which gives a different version of "identically distributed": every unit shares the same data-generating process. Counterfactual independence requires assumptions beyond i.i.d., since counterfactual outcomes are never observed jointly. Invariant Causal Prediction (Peters, Buhlmann, and Meinshausen 2016) and Invariant Risk Minimization (Arjovsky, Bottou, Gulrajani, and Lopez-Paz 2019) try to use multi-environment training data to learn predictors that depend only on causally relevant features, with the hope that such predictors will generalize OOD.
A recurring theme in this literature is that i.i.d. training data is fundamentally insufficient for identifying causal effects. Either domain knowledge or interventional data is necessary.
Practical splits
The way data is split into training and test sets implicitly relies on the i.i.d. assumption. When data is truly i.i.d., a random split ensures that both the training set and the test set are representative samples from the same distribution. When data is not i.i.d., random splitting can be misleading and the appropriate strategy depends on the type of dependence.
Random splits
A uniform random split assigns each example independently to train, validation, or test sets with fixed probabilities. This is the default in most machine learning pipelines and is appropriate only when the data is i.i.d. or close to it.
Stratified splits
Stratified splitting preserves the class balance (or any other stratifying variable) across train and test sets. The split is still i.i.d. conditional on the stratum, which is often preferable when class imbalance is severe.
Time-based splits
For time series data, the standard approach is a temporal split: train on earlier data and test on later data, preserving the natural order. Variations include rolling-window evaluation, expanding-window evaluation, and forward chaining cross-validation. A time-based split prevents future information from leaking into the training set.
Group-based splits
For grouped or clustered data, group-aware splitting ensures that all observations from one group appear in either the training set or the test set, but not both. Examples include splitting patients across hospitals, users across mobile devices, or images by photographer or camera. The scikit-learn class GroupKFold implements this idea for k-fold cross-validation.
Spatial blocking
For spatial data, spatial blocking strategies assign entire geographic regions to either training or testing. Approaches include checkerboard blocking, k-fold blocking, and buffered leave-one-out cross-validation, which leaves out an entire region plus a buffer around it.
Data leakage
Data leakage, where information from the test set contaminates the training process, is a direct consequence of ignoring i.i.d. violations during splitting. Common forms of leakage include: preprocessing the entire dataset (including the test set) before splitting; using future information in time series features; splitting at the row level when groups should be respected; and target leakage from features that are constructed using the label. Preprocessing steps such as normalization or feature engineering must be fit only on the training data to prevent leakage.
Examples of i.i.d. data
The following table illustrates common scenarios that produce i.i.d. data and contrasts them with non-i.i.d. counterparts.
| Scenario | i.i.d.? | Explanation |
|---|
| Fair coin tosses | Yes | Each toss is independent with a constant probability of 0.5 for heads |
| Fair dice rolls | Yes | Each roll is independent with identical probability (1/6) per face |
| Drawing cards with replacement | Yes | Returning the card before the next draw keeps probabilities constant |
| Roulette wheel spins | Yes | Each spin is independent and the probability of each outcome is fixed |
| Height measurements from random sampling | Yes | Each person is measured independently, drawn from the same population |
| Drawing cards without replacement | No | Removing a card changes the probability for the next draw (not identically distributed) |
| Daily stock prices | No | Each price depends on the previous price (not independent) |
| Temperature readings over a week | No | Adjacent readings are correlated and may follow a trend (not independent, possibly not identically distributed) |
| Survey responses from the same household | No | Responses within a household may be correlated (not independent) |
| Sequential frames of video | No | Adjacent frames are nearly identical and highly correlated |
| Patients in a hospital cohort | No | Patients share doctors, equipment, and protocols |
| Tweets during a breaking news event | No | Topics shift sharply and tweets reference each other |
Consequences of violating i.i.d.
Ignoring i.i.d. violations can cause serious problems in practice.
| Consequence | Description |
|---|
| Biased parameter estimates | Correlated observations effectively reduce the true sample size, so estimates of means, variances, and model coefficients may be biased |
| Invalid confidence intervals | Standard error formulas assume independence; with correlated data, confidence intervals are too narrow and p-values are too small |
| Inflated performance metrics | Random train/test splits on non-i.i.d. data allow information leakage, producing overly optimistic accuracy estimates |
| Poor generalization | A model trained on data from one distribution may fail on data from a shifted distribution |
| Unstable training | Non-i.i.d. mini-batches in stochastic gradient descent can cause erratic gradient updates and slow convergence |
| Overfitting to spurious patterns | The model may learn correlations that are artifacts of the data structure rather than genuine patterns |
| Calibration failures | Predicted probabilities may be miscalibrated on shifted test distributions, with consequences in high-stakes applications |
| Fairness regressions | Subgroup performance can degrade silently when subpopulations shift between training and deployment |
Sampling methods that produce i.i.d. data
Producing genuinely i.i.d. data requires careful attention to sampling design.
| Sampling Method | Produces i.i.d. Data? | Notes |
|---|
| Simple random sampling (with replacement) | Yes | Each draw is independent and from the same population |
| Simple random sampling (without replacement) | Approximately, for large populations | Draws become weakly dependent, but the effect is negligible when the population is much larger than the sample |
| Stratified random sampling | Conditionally | Within each stratum, samples can be i.i.d.; across strata, the combined sample is not strictly i.i.d. |
| Cluster sampling | No | Observations within clusters are correlated |
| Convenience sampling | No | Selection bias means observations are not identically distributed |
| Systematic sampling | No | Fixed intervals introduce dependence between selected units |
| Reservoir sampling | Yes for the kept sample | Maintains a uniform random sample from a stream of unknown length |
In machine learning practice, data is often collected opportunistically rather than through formal sampling designs. Web scraping, user logs, and sensor streams rarely produce perfectly i.i.d. data. Recognizing these limitations is important for interpreting model performance honestly.
Explain like I'm 5 (ELI5)
Imagine you have a big jar of jellybeans. The jar has many different colors mixed together. You close your eyes and pick one jellybean at a time.
"Identically distributed" means that the jar stays the same every time you pick. You always put the jellybean back before picking the next one, so every pick has the same chances. If 30 percent of the jellybeans are red, then every single pick has a 30 percent chance of being red.
"Independent" means that the jellybean you picked last time has absolutely no effect on what you pick this time. Picking a red jellybean does not make it more or less likely that the next one will be red.
When both of these things are true at the same time, we say the picks are "independently and identically distributed," or i.i.d. for short.
Now imagine you do not put the jellybean back. After you take a red one out, there are fewer reds in the jar. Your next pick is slightly different from the first one. That is not i.i.d. anymore because the jar changed (not identically distributed) and your first pick affected what was left (not independent).
In machine learning, we usually want our data to be like the first scenario (putting jellybeans back). We want every piece of data to come from the same source and not be affected by other pieces of data. When that is true, our computer programs can learn patterns more reliably.
See also
References
- Casella, G., and Berger, R. L. (2002). *Statistical Inference* (2nd ed.). Duxbury Press.
- Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer.
- Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.). Springer.
- Shalev-Shwartz, S., and Ben-David, S. (2014). *Understanding Machine Learning: From Theory to Algorithms*. Cambridge University Press.
- Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press.
- Wasserman, L. (2004). *All of Statistics: A Concise Course in Statistical Inference*. Springer.
- Schervish, M. J. (1995). *Theory of Statistics*. Springer.
- Kolmogorov, A. N. (1933). *Grundbegriffe der Wahrscheinlichkeitsrechnung*. Springer-Verlag. (English translation: *Foundations of the Theory of Probability*, 1950.)
- Bernoulli, J. (1713). *Ars Conjectandi*. Basel.
- De Finetti, B. (1937). "La prevision: ses lois logiques, ses sources subjectives." *Annales de l'Institut Henri Poincare*, 7(1), 1-68.
- Valiant, L. G. (1984). "A theory of the learnable." *Communications of the ACM*, 27(11), 1134-1142.
- Vapnik, V. N., and Chervonenkis, A. Ya. (1971). "On the uniform convergence of relative frequencies of events to their probabilities." *Theory of Probability and Its Applications*, 16(2), 264-280.
- Bartlett, P. L., and Mendelson, S. (2002). "Rademacher and Gaussian complexities: Risk bounds and structural results." *Journal of Machine Learning Research*, 3, 463-482.
- Shimodaira, H. (2000). "Improving predictive inference under covariate shift by weighting the log-likelihood function." *Journal of Statistical Planning and Inference*, 90(2), 227-244.
- Sugiyama, M., Krauledat, M., and Muller, K.-R. (2007). "Covariate shift adaptation by importance weighted cross validation." *Journal of Machine Learning Research*, 8, 985-1005.
- Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. (eds.) (2009). *Dataset Shift in Machine Learning*. MIT Press.
- Moreno-Torres, J. G., Raeder, T., Alaiz-Rodriguez, R., Chawla, N. V., and Herrera, F. (2012). "A unifying view on dataset shift in classification." *Pattern Recognition*, 45(1), 521-530.
- Lipton, Z. C., Wang, Y.-X., and Smola, A. (2018). "Detecting and correcting for label shift with black box predictors." *Proceedings of the 35th International Conference on Machine Learning (ICML)*, PMLR 80, 3122-3130.
- Gama, J., Zliobaite, I., Bifet, A., Pechenizkiy, M., and Bouchachia, A. (2014). "A survey on concept drift adaptation." *ACM Computing Surveys*, 46(4), Article 44.
- Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. (2016). "Domain-adversarial training of neural networks." *Journal of Machine Learning Research*, 17(59), 1-35.
- Sun, B., and Saenko, K. (2016). "Deep CORAL: Correlation alignment for deep domain adaptation." *Proceedings of the European Conference on Computer Vision (ECCV) Workshops*.
- Wang, D., Shelhamer, E., Liu, S., Olshausen, B., and Darrell, T. (2021). "TENT: Fully test-time adaptation by entropy minimization." *International Conference on Learning Representations (ICLR)*.
- Hendrycks, D., and Gimpel, K. (2017). "A baseline for detecting misclassified and out-of-distribution examples in neural networks." *International Conference on Learning Representations (ICLR)*.
- Liang, S., Li, Y., and Srikant, R. (2018). "Enhancing the reliability of out-of-distribution image detection in neural networks." *International Conference on Learning Representations (ICLR)*.
- Liu, W., Wang, X., Owens, J. D., and Li, Y. (2020). "Energy-based out-of-distribution detection." *Advances in Neural Information Processing Systems (NeurIPS) 33*.
- Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder, A., Muller, E., and Kloft, M. (2018). "Deep one-class classification." *Proceedings of the 35th International Conference on Machine Learning (ICML)*.
- Hendrycks, D., and Dietterich, T. (2019). "Benchmarking neural network robustness to common corruptions and perturbations." *International Conference on Learning Representations (ICLR)*.
- Gulrajani, I., and Lopez-Paz, D. (2021). "In search of lost domain generalization." *International Conference on Learning Representations (ICLR)*.
- Koh, P. W., Sagawa, S., Marklund, H., Xie, S. M., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R. L., Gao, I., Lee, T., David, E., Stavness, I., Guo, W., Earnshaw, B. A., Haque, I. S., Beery, S., Leskovec, J., Kundaje, A., Pierson, E., Koyejo, S., Schmidt, L., and Liang, P. (2021). "WILDS: A benchmark of in-the-wild distribution shifts." *Proceedings of the 38th International Conference on Machine Learning (ICML)*.
- Wiles, O., Gowal, S., Stimberg, F., Rebuffi, S.-A., Ktena, I., Dvijotham, K., and Cemgil, T. (2022). "A fine-grained analysis on distribution shift." *International Conference on Learning Representations (ICLR)*.
- McMahan, B., Moore, E., Ramage, D., Hampson, S., and Aguera y Arcas, B. (2017). "Communication-efficient learning of deep networks from decentralized data." *Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS)*, PMLR 54, 1273-1282.
- Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., and Smith, V. (2020). "Federated optimization in heterogeneous networks." *Proceedings of the 3rd MLSys Conference*.
- Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S. J., Stich, S. U., and Suresh, A. T. (2020). "SCAFFOLD: Stochastic controlled averaging for federated learning." *Proceedings of the 37th International Conference on Machine Learning (ICML)*.
- Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., and many others (2021). "Advances and open problems in federated learning." *Foundations and Trends in Machine Learning*, 14(1-2), 1-210.
- Vovk, V., Gammerman, A., and Shafer, G. (2005). *Algorithmic Learning in a Random World*. Springer.
- Shafer, G., and Vovk, V. (2008). "A tutorial on conformal prediction." *Journal of Machine Learning Research*, 9, 371-421.
- Tibshirani, R. J., Foygel Barber, R., Candes, E., and Ramdas, A. (2019). "Conformal prediction under covariate shift." *Advances in Neural Information Processing Systems (NeurIPS) 32*.
- Barber, R. F., Candes, E. J., Ramdas, A., and Tibshirani, R. J. (2023). "Conformal prediction beyond exchangeability." *Annals of Statistics*, 51(2), 816-845.
- Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. (2019). "Invariant risk minimization." arXiv:1907.02893.
- Peters, J., Buhlmann, P., and Meinshausen, N. (2016). "Causal inference using invariant prediction: identification and confidence intervals." *Journal of the Royal Statistical Society Series B*, 78(5), 947-1012.