See also: Machine learning terms, Probability, Statistics
In probability theory and statistics, a collection of random variables is said to be independently and identically distributed (abbreviated i.i.d., iid, or IID) if each variable has the same probability distribution as the others and all variables are mutually independent. This concept is one of the most important foundational assumptions across machine learning, statistical inference, and data science. Nearly every classical algorithm in supervised and unsupervised learning relies, either explicitly or implicitly, on the assumption that the data points in a dataset are i.i.d. samples drawn from some underlying distribution.
The i.i.d. assumption simplifies the mathematics of learning and inference considerably. It allows the joint probability of an entire dataset to be expressed as the product of individual probabilities, which makes optimization tractable and enables powerful theoretical guarantees about generalization. However, many real-world datasets violate this assumption, and understanding when and how i.i.d. breaks down is essential for building reliable models.
The term "independently and identically distributed" combines two distinct mathematical properties:
A set of random variables X₁, X₂, ..., Xₙ is independent if the realization of any one variable provides no information about the others. Formally, two random variables X and Y are independent if their joint cumulative distribution function (CDF) equals the product of their individual CDFs:
F(X, Y)(x, y) = F(X)(x) * F(Y)(y) for all x, y
Equivalently, in terms of probabilities:
P(X in A, Y in B) = P(X in A) * P(Y in B) for all events A and B
For a full collection of n variables, this factorization must hold for every possible subset, not just for pairs.
Random variables are identically distributed if they all share the same probability distribution. Formally, X₁ and X₂ are identically distributed if:
F(X₁)(x) = F(X₂)(x) for all x
This means that the mechanism generating each data point is the same. There are no trends, shifts, or changes in the distribution over time or across samples.
A sequence of random variables is i.i.d. if and only if both conditions hold simultaneously. Each observation must be drawn from the same distribution, and the draw of one observation must have no effect on any other. When both conditions are satisfied, the joint probability of the entire dataset decomposes into a simple product:
P(X₁, X₂, ..., Xₙ) = P(X₁) * P(X₂) * ... * P(Xₙ)
This factorization is the key property that makes statistical analysis and maximum likelihood estimation computationally feasible.
The following table illustrates common scenarios that produce i.i.d. data and contrasts them with non-i.i.d. counterparts:
| Scenario | i.i.d.? | Explanation |
|---|---|---|
| Fair coin tosses | Yes | Each toss is independent with a constant probability of 0.5 for heads |
| Fair dice rolls | Yes | Each roll is independent with identical probability (1/6) per face |
| Drawing cards with replacement | Yes | Returning the card before the next draw keeps probabilities constant |
| Roulette wheel spins | Yes | Each spin is independent and the probability of each outcome is fixed |
| Height measurements from random sampling | Yes | Each person is measured independently, drawn from the same population |
| Drawing cards without replacement | No | Removing a card changes the probability for the next draw (not identically distributed) |
| Daily stock prices | No | Each price depends on the previous price (not independent) |
| Temperature readings over a week | No | Adjacent readings are correlated and may follow a trend (not independent, possibly not identically distributed) |
| Survey responses from the same household | No | Responses within a household may be correlated (not independent) |
The i.i.d. assumption is deeply embedded in the foundations of machine learning. Most standard algorithms and evaluation procedures assume that data points in the training set and test set are i.i.d. samples from the same underlying distribution. This assumption matters for several interconnected reasons.
When data is i.i.d., the likelihood function for the entire dataset factors into a product of individual likelihoods. This makes it possible to use gradient descent and other optimization methods to find model parameters that maximize the likelihood. Without this factorization, computing the joint probability of all observations would require modeling complex dependencies between every pair of data points.
Statistical learning theory provides bounds on how well a model trained on a finite sample will perform on unseen data. These generalization bounds, including the Probably Approximately Correct (PAC) learning framework and Vapnik-Chervonenkis (VC) theory, assume that training and test data are drawn i.i.d. from the same distribution. If this assumption holds, a model that performs well on the training data is likely to perform well on new data, provided the model is not too complex (avoiding overfitting).
Standard evaluation techniques such as cross-validation, holdout validation, and bootstrapping all assume i.i.d. data. In k-fold cross-validation, for example, the data is randomly partitioned into k subsets. The validity of this procedure depends on each data point being interchangeable, meaning no data point carries information about another. When the data is not i.i.d., random splitting can produce overly optimistic performance estimates because information can leak between folds.
The following table summarizes common machine learning algorithms and how they rely on the i.i.d. assumption:
| Algorithm | Role of i.i.d. Assumption |
|---|---|
| Linear regression | Assumes residuals are i.i.d.; violations produce biased standard errors |
| Logistic regression | Assumes observations are independent; correlated data inflates significance |
| Decision trees / Random forests | Splitting and bagging assume i.i.d. samples; temporal order is ignored |
| Neural networks | Stochastic gradient descent shuffles data randomly, assuming order does not matter |
| K-means clustering | Assumes data points are drawn independently from a mixture distribution |
| Naive Bayes | Assumes features are conditionally independent given the class label |
| Support vector machines | Generalization bounds depend on i.i.d. training samples |
Many real-world datasets do not satisfy the i.i.d. assumption. Recognizing these violations is critical for selecting appropriate modeling strategies.
Time series data violates independence because observations at adjacent time steps are typically correlated (autocorrelation). Stock prices, weather measurements, sensor readings, and web traffic all exhibit temporal dependencies. Using standard cross-validation on time series data can produce misleadingly high accuracy because future information leaks into the training set.
Geographic and spatial data often exhibits spatial autocorrelation, where nearby locations have more similar values than distant ones. For example, air quality measurements, housing prices, or soil characteristics at nearby locations tend to be correlated. Treating spatially correlated observations as independent can lead to underestimation of standard errors and inflated type I error rates.
When data points are connected through a social network, citation graph, or biological network, observations are not independent. A person's behavior on a social network is influenced by their connections. Standard i.i.d. methods applied to network data can yield misleading results because they ignore the relational structure.
In medical studies, students within the same school, or patients treated by the same doctor share latent characteristics. Observations within a group are more similar to each other than to observations from other groups, violating independence.
Even when observations are independent, they may not be identically distributed. Distribution shift occurs when the data distribution changes between training and deployment. This can happen due to changes in user behavior, market conditions, or sensor calibration over time.
Ignoring i.i.d. violations can cause serious problems in practice:
| Consequence | Description |
|---|---|
| Biased parameter estimates | Correlated observations effectively reduce the true sample size, so estimates of means, variances, and model coefficients may be biased |
| Invalid confidence intervals | Standard error formulas assume independence; with correlated data, confidence intervals are too narrow and p-values are too small |
| Inflated performance metrics | Random train/test splits on non-i.i.d. data allow information leakage, producing overly optimistic accuracy estimates |
| Poor generalization | A model trained on data from one distribution may fail on data from a shifted distribution |
| Unstable training | Non-i.i.d. mini-batches in stochastic gradient descent can cause erratic gradient updates and slow convergence |
| Overfitting to spurious patterns | The model may learn correlations that are artifacts of the data structure rather than genuine patterns |
The way data is split into training and test sets implicitly relies on the i.i.d. assumption. When data is truly i.i.d., a random split ensures that both the training set and the test set are representative samples from the same distribution. However, when data is not i.i.d., random splitting can be dangerously misleading.
For time series data, the standard approach is to use a temporal split: train on earlier data and test on later data, preserving the natural order. For grouped data, group-aware splitting ensures that all observations from one group appear in either the training set or the test set, but not both. For spatial data, spatial blocking strategies assign entire geographic regions to either training or testing.
Data leakage, where information from the test set contaminates the training process, is a direct consequence of ignoring i.i.d. violations during splitting. Preprocessing steps such as normalization or feature engineering must be fit only on the training data to prevent leakage.
Several active research areas in machine learning specifically address scenarios where the i.i.d. assumption does not hold.
Federated learning trains models across many decentralized devices (such as smartphones) without centralizing the data. Each device holds a local dataset that reflects its user's behavior, so data across devices is typically non-i.i.d. One device might contain mostly photos of food while another contains mostly photos of pets. This heterogeneity causes standard aggregation methods (such as Federated Averaging) to converge slowly or to a suboptimal solution. Techniques like FedProx, SCAFFOLD, and local drift correction have been developed to handle non-i.i.d. data distributions in federated settings.
Distribution shift occurs when the test distribution differs from the training distribution. Covariate shift, label shift, and concept drift are all forms of distribution shift. Domain adaptation methods attempt to bridge the gap between source and target distributions. Transfer learning pretrains a model on one distribution and fine-tunes it on another, partially mitigating the effects of distribution shift.
Concept drift refers to changes in the statistical properties of the target variable over time. In fraud detection, for example, the patterns associated with fraudulent transactions evolve as criminals adapt their strategies. Models must be continuously updated or equipped with drift detection mechanisms to maintain accuracy.
Online learning algorithms process data sequentially and update the model after each observation. These algorithms are designed for non-i.i.d. settings where the data distribution may change over time. Continual learning addresses the challenge of learning from a non-stationary stream of data without forgetting previously learned knowledge (catastrophic forgetting).
Two foundational theorems in probability theory rely on the i.i.d. assumption.
The law of large numbers states that if X₁, X₂, ..., Xₙ are i.i.d. random variables with finite mean mu, then the sample mean converges to mu as n grows large. This theorem justifies using sample averages to estimate population means and underpins the entire framework of statistical estimation.
The central limit theorem states that if X₁, X₂, ..., Xₙ are i.i.d. random variables with finite mean mu and finite variance sigma-squared, then the standardized sample mean converges in distribution to a standard normal distribution as n approaches infinity:
(X-bar - mu) / (sigma / sqrt(n)) converges to N(0, 1)
The CLT is the reason that many statistical methods assume normality. It explains why the sampling distribution of the mean is approximately normal regardless of the shape of the underlying distribution, provided the sample size is large enough. Extensions of the CLT, such as the Lindeberg-Levy CLT and the Lyapunov CLT, relax some of these conditions (for example, allowing variables that are independent but not identically distributed), but the classical version requires full i.i.d.
Producing genuinely i.i.d. data requires careful attention to sampling design:
| Sampling Method | Produces i.i.d. Data? | Notes |
|---|---|---|
| Simple random sampling (with replacement) | Yes | Each draw is independent and from the same population |
| Simple random sampling (without replacement) | Approximately, for large populations | Draws become weakly dependent, but the effect is negligible when the population is much larger than the sample |
| Stratified random sampling | Conditionally | Within each stratum, samples can be i.i.d.; across strata, the combined sample is not strictly i.i.d. |
| Cluster sampling | No | Observations within clusters are correlated |
| Convenience sampling | No | Selection bias means observations are not identically distributed |
| Systematic sampling | No | Fixed intervals introduce dependence between selected units |
In machine learning practice, data is often collected opportunistically rather than through formal sampling designs. Web scraping, user logs, and sensor streams rarely produce perfectly i.i.d. data. Recognizing these limitations is important for interpreting model performance honestly.
Exchangeability is a weaker condition than i.i.d. that plays an important role in Bayesian statistics.
A sequence of random variables is exchangeable if the joint distribution is invariant under any permutation of the indices. That is, the order of observations does not matter. Every i.i.d. sequence is exchangeable, but the converse is not true. For example, draws from a Polya urn model are exchangeable but not independent: each draw changes the composition of the urn, so future draws depend on past draws.
De Finetti's theorem provides a deep connection between exchangeability and i.i.d. The theorem states that any infinite sequence of exchangeable random variables can be represented as a mixture of i.i.d. sequences. More precisely, exchangeable variables are conditionally i.i.d. given some latent parameter. This result is foundational for Bayesian inference, where the unknown parameter is treated as a random variable with a prior distribution.
The practical implication is that exchangeability is often a more realistic assumption than strict i.i.d. in Bayesian modeling. It allows for dependence between observations while still enabling tractable inference through the mixture representation.
Imagine you have a big jar of jellybeans. The jar has many different colors mixed together. You close your eyes and pick one jellybean at a time.
"Identically distributed" means that the jar stays the same every time you pick. You always put the jellybean back before picking the next one, so every pick has the same chances. If 30% of the jellybeans are red, then every single pick has a 30% chance of being red.
"Independent" means that the jellybean you picked last time has absolutely no effect on what you pick this time. Picking a red jellybean does not make it more or less likely that the next one will be red.
When both of these things are true at the same time, we say the picks are "independently and identically distributed," or i.i.d. for short.
Now imagine you do not put the jellybean back. After you take a red one out, there are fewer reds in the jar. Your next pick is slightly different from the first one. That is not i.i.d. anymore because the jar changed (not identically distributed) and your first pick affected what was left (not independent).
In machine learning, we usually want our data to be like the first scenario (putting jellybeans back). We want every piece of data to come from the same source and not be affected by other pieces of data. When that is true, our computer programs can learn patterns more reliably.