Machine learning algorithms often make the assumption of independently and identically distributed (i.i.d.) data, which implies each data point is drawn independently from a given probability distribution. This assumption is essential for many machine learning algorithms as it permits powerful mathematical operations to make predictions based on observed patterns in the data.
Formally speaking, a group of data points is considered independent if each is independently generated from the same probability distribution. This implies that the occurrence of one data point does not influence another and the probability of observing any particular one remains unchanged for all in the dataset.
In practice, it is often assumed that i.i.d. data comes randomly from a given distribution. For instance, in the case of supervised learning, where models are trained to predict an output variable based on input features, i.i.d. data would mean that the input features and target variable are independently and identically distributed.
The i.i.d. assumption is essential in machine learning for several reasons.
Data that is independent and randomly collected, such as height measurements for people, can be considered i.i.d. Another example would be images of handwritten digits where each image was independently and randomly created by a different individual writing the digit.
Imagine you have a bag filled with many candies. Each candy is unique in color, shape and taste; when you reach into the bag and grab one candy it's like picking an object out of machine learning algorithms.
When we say that candies are "independently and identically distributed," it implies two things:
Data that is "independently and identically distributed" refers to pieces of information that are unconnected to one another and identical in type.