Sampling bias is a systematic error in statistics and machine learning that occurs when a sample is collected in such a way that some members of the intended population have a lower or higher probability of being included than others. This results in a non-representative sample where conclusions drawn from the data may be erroneously attributed to the phenomenon under study rather than to the flawed method of data collection. In statistical terms, sampling bias leads to systematic over- or under-estimation of population parameters, undermining the validity of any analysis built on the biased data.
Sampling bias is distinct from sampling error, which arises from random variation in sample selection. While sampling error relates to precision and decreases with larger sample sizes, sampling bias relates to accuracy and cannot be corrected simply by collecting more data. A biased sample of ten million observations remains biased; the 1936 Literary Digest presidential poll demonstrated this clearly when over two million responses still produced a wildly incorrect prediction.
In machine learning, sampling bias is one of the most common sources of poor model performance in production. A model trained on a biased dataset may achieve high accuracy on its test data while failing on real-world inputs, because both the training and test sets share the same systematic gaps in coverage.
Imagine you want to find out what everyone's favorite ice cream flavor is. But instead of asking kids at every school, you only ask kids at the school next to the chocolate ice cream factory. Most of those kids will probably say chocolate because they smell it every day and get free samples. If you then tell people that chocolate is the world's favorite flavor, you would be wrong, because you only asked a special group of kids who had a reason to like chocolate more.
Sampling bias works the same way. When you only look at information from certain kinds of people (or data points), you miss what everyone else thinks, and your answer ends up lopsided.
In probability theory, let a population consist of N individuals, and let a sampling mechanism assign each individual i a selection probability p_i. A sample is unbiased if p_i = 1/N for all i (simple random sampling) or, more generally, if every individual has a known, nonzero probability of selection (probability sampling). Sampling bias occurs when the actual selection probabilities deviate from the intended ones, meaning that for some individuals p_i is systematically too high or too low, or even zero.
If theta is the population parameter of interest (for example, the mean) and theta_hat is the estimator calculated from the sample, then the bias is defined as:
Bias(theta_hat) = E[theta_hat] - theta
When this bias is nonzero and arises from the way the sample was selected rather than from the estimator itself, it constitutes sampling bias. Because this bias is systematic, increasing the sample size n does not reduce it. Only changes to the sampling procedure or post-hoc corrections (such as reweighting) can address it.
Sampling bias is usually classified as a subtype of selection bias, though the two terms are often used interchangeably. A useful distinction is that sampling bias primarily threatens external validity (the ability to generalize results to the full population), while selection bias more broadly addresses internal validity (whether differences within the sample reflect genuine effects or artifacts of how participants were chosen). In practice, both concepts overlap considerably, and many sources treat them as synonyms.
Sampling bias manifests in many forms depending on the mechanism that distorts the sample. The table below summarizes the most widely recognized types.
| Type | Description | Example |
|---|---|---|
| Self-selection bias | Occurs when individuals volunteer to participate, and those who choose to participate differ systematically from those who do not | Online satisfaction surveys tend to attract people with strong opinions (very satisfied or very dissatisfied), while indifferent users rarely respond |
| Non-response bias | Arises when people who do not respond to a survey differ from those who do on variables of interest | In the U.S. Census Bureau's American Community Survey during 2020, low-earning households were much less likely to respond, biasing income estimates upward and poverty estimates downward |
| Survivorship bias | Results from focusing only on subjects that "survived" a selection process while ignoring those that did not | Analyzing only currently successful companies to draw business lessons ignores the many companies that tried the same strategies and failed |
| Undercoverage bias | Occurs when certain segments of the population are excluded or underrepresented in the sampling frame | A telephone survey conducted only via landlines misses people who only use mobile phones, typically younger and lower-income individuals |
| Overcoverage bias | Occurs when some population members appear multiple times in the sampling frame, inflating their selection probability | A mailing list with duplicate entries causes certain individuals to receive multiple survey invitations, making them more likely to be counted |
| Convenience bias | Results from selecting participants based on ease of access rather than randomization | A psychology study that recruits only college undergraduates may not generalize to the broader adult population |
| Reporting bias | Occurs when certain outcomes are more likely to be published or reported, skewing the available evidence | Medical journals historically published positive drug trial results more often than negative ones, creating a skewed picture of treatment effectiveness |
| Healthy user bias | The study population is systematically healthier than the general population | Studies on occupational health among manual laborers miss workers who left the occupation due to illness, overestimating the health of the remaining workers |
| Berkson's bias (admission rate bias) | A spurious association between diseases observed in hospital-based studies, because having either condition increases the probability of hospitalization | A 1946 study by Joseph Berkson showed that hospital patients without diabetes appeared more likely to have cholecystitis, simply because they needed some reason to be admitted |
| Temporal bias | Data collected during a specific time window does not represent the population across different time periods | Training a fraud detection model exclusively on holiday-season transaction data may cause poor performance during normal spending periods |
| Participation bias | The act of participating in a study changes the behavior or characteristics of participants | Patients enrolled in clinical trials may receive more attentive care than the general patient population, independent of the treatment being studied |
| Pre-screening bias | How a study is advertised or screened determines who sees and responds to it | An online ad for a health study on a fitness website attracts health-conscious respondents who are not representative of the general population |
Survivorship bias deserves special attention because of its pervasiveness and its often counterintuitive nature. The classic example comes from World War II. The U.S. military examined bullet damage on bombers returning from combat missions and initially proposed reinforcing the most heavily damaged areas (fuselage and wings). The mathematician Abraham Wald, working with the Statistical Research Group at Columbia University, recognized that this analysis suffered from survivorship bias. The planes being examined were the ones that survived; the bullet holes showed where a plane could take damage and still fly home. The areas with no damage on returning planes were likely the areas where hits proved fatal, because those planes never made it back. Wald recommended reinforcing the undamaged areas instead, and the military adopted his advice.
Survivorship bias appears frequently in everyday reasoning:
Self-selection bias (also called volunteer bias) is one of the most common forms of sampling bias in research and data collection. It occurs when individuals decide for themselves whether to participate in a study, and the decision to participate is correlated with the variables being studied.
Research has consistently found that volunteers tend to:
This type of bias is especially problematic in online surveys and phone-in polls, where participation is entirely voluntary. As a result, these instruments tend to produce a "polarization of responses," with extreme perspectives receiving disproportionate weight while moderate views are underrepresented.
In machine learning, self-selection bias appears when user-generated data forms the training data. For example, product review datasets are dominated by users who feel strongly enough to write a review, while the silent majority of satisfied (but not enthusiastic) customers are absent from the data.
Berkson's bias (also known as Berkson's paradox or admission rate bias) is a form of sampling bias specific to studies conducted within hospitals or clinics. First described by the biostatistician Joseph Berkson in 1946, it occurs because the combination of a disease and an exposure both independently increase the probability of hospital admission. When cases and controls are both drawn from a hospital population, this shared pathway to admission creates a spurious (usually negative) correlation between the disease and the exposure.
For example, suppose a researcher wants to study whether diabetes is associated with cholecystitis (gallbladder disease). If the study recruits both cases and controls from hospital patients, the controls (patients without diabetes) must have been admitted for some other reason, making them more likely to have cholecystitis. This artificially inflates the apparent association between diabetes and cholecystitis.
The solution to Berkson's bias is straightforward in principle: use population-based sampling rather than hospital-based sampling. When every member of the population has an equal chance of being selected, the distortion introduced by differential admission rates disappears.
Several well-known historical cases illustrate the consequences of sampling bias.
The Literary Digest magazine conducted one of the largest polls in history to predict the outcome of the 1936 U.S. presidential election between Franklin D. Roosevelt and Alf Landon. The magazine mailed over 10 million questionnaires and received approximately 2.4 million responses. Based on these responses, the Digest predicted that Landon would win decisively with 57% of the vote.
Roosevelt won in a landslide with 62% of the popular vote, carrying 46 of 48 states.
The poll's failure stemmed from two forms of sampling bias. First, the mailing lists were drawn from telephone directories, automobile registration records, and country club memberships. During the Great Depression, these sources over-represented wealthy Americans, who were more likely to favor the Republican candidate. Second, of the 10 million people contacted, only 2.4 million responded (a 24% response rate), and Landon supporters were disproportionately motivated to return their ballots, introducing severe nonresponse bias.
Meanwhile, George Gallup's organization correctly predicted Roosevelt's victory using a carefully selected sample of just 50,000 citizens. This event established two lasting lessons: a large sample does not compensate for a biased sampling method, and how a sample is selected matters far more than how large it is.
In 1948, the Chicago Tribune published its famous "DEWEY DEFEATS TRUMAN" headline based on telephone polls that predicted Thomas Dewey would defeat Harry Truman. The polls suffered from coverage bias because telephones were not yet widespread; people who owned phones tended to be wealthier and more likely to vote Republican. The poll used was also two weeks old, missing a late swing in voter sentiment.
During the early months of the COVID-19 pandemic, wide variations in testing policies across countries introduced substantial sampling bias into case counts. Countries that tested primarily hospitalized patients reported much higher case fatality rates than countries that conducted broader community testing. These differences were largely artifacts of sampling rather than genuine differences in disease severity. Researchers showed that variations in sampling bias accounted for much of the observed international variation in both case fatality rates and the apparent age distribution of cases.
Sampling bias is one of the most significant sources of bias in machine learning systems. Because ML models learn patterns from their training data, any systematic gaps or distortions in the data are directly reflected in the model's predictions.
When a dataset is not representative of the population the model will encounter in production, several problems arise:
| System or dataset | Type of sampling bias | Consequence |
|---|---|---|
| ImageNet | Geographic and demographic undercoverage | The dataset over-represented lighter-skinned individuals from Western countries, leading to lower accuracy on images of people from other regions. ImageNet later removed over 500,000 images from its "person" category after the biases were exposed. |
| Commercial facial recognition | Demographic undercoverage | A 2018 study by Joy Buolamwini and Timnit Gebru found that gender classification error rates for darker-skinned women were up to 34.7%, compared to 0.8% for lighter-skinned men, in commercial systems from major vendors. |
| COMPAS recidivism tool | Racial representation bias | ProPublica's 2016 analysis found that the COMPAS system's false positive rate (predicting recidivism when it did not occur) was significantly higher for Black defendants than for white defendants, raising questions about whether the training data reflected existing disparities in the criminal justice system. |
| Healthcare risk prediction | Historical utilization bias | A 2019 study published in Science found that a widely used algorithm for predicting healthcare needs assigned systematically lower risk scores to Black patients. The algorithm used healthcare spending as a proxy for health needs, but because Black patients historically had less access to healthcare, their lower spending did not reflect lower medical need. |
| Cardiac MRI segmentation | Demographic undercoverage | A deep learning model trained on data that was 80% White achieved a Dice Similarity Coefficient of 93.5% for White subjects but only 84.5% for Black and Mixed-race subjects. |
Temporal bias is a form of sampling bias where the training data reflects conditions from a specific time period that may not hold in the future. This is closely related to the concept of concept drift, where the statistical relationship between input features and output labels changes over time.
Examples include:
Dealing with temporal bias typically requires continuous monitoring of model performance and periodic retraining on updated data. Many production ML systems implement automated data drift detection to alert engineers when the distribution of incoming data diverges significantly from the training distribution.
Identifying sampling bias is the first step toward addressing it. Several approaches are used in both traditional statistics and machine learning.
Mitigating sampling bias requires intervention at different stages of the data collection and modeling pipeline.
| Technique | Description | When to use |
|---|---|---|
| Simple random sampling | Every member of the population has an equal probability of being selected | When a complete list of the population (sampling frame) is available |
| Stratified sampling | The population is divided into subgroups (strata) based on key characteristics, and random samples are drawn from each stratum in proportion to its population share | When specific subgroups must be adequately represented, such as minority demographic groups |
| Cluster sampling | The population is divided into clusters (often geographic), a random selection of clusters is chosen, and all or some members within those clusters are sampled | When a complete population list is unavailable but clusters can be identified |
| Systematic sampling | Every kth member of an ordered population list is selected after a random starting point | When the population can be listed but simple random selection is impractical |
| Oversampling minority groups | Intentionally sampling underrepresented groups at higher rates, then applying weights to produce population-representative estimates | When certain groups are rare in the population but must be well-represented for analysis |
| Technique | Description | Considerations |
|---|---|---|
| Sample reweighting | Assigning weights to each observation so that the weighted sample matches the population distribution. Each sample receives a weight equal to its population proportion divided by its sampling proportion. | Requires knowledge of the true population distribution; can increase variance if weights are extreme |
| Oversampling (random) | Duplicating existing minority class observations to balance class proportions | Risks overfitting because the model sees identical copies of minority examples |
| Undersampling (random) | Removing majority class observations to balance class proportions | Loses potentially useful information from the majority class |
| SMOTE (Synthetic Minority Over-sampling Technique) | Generates synthetic minority class examples by interpolating between existing minority class neighbors rather than duplicating them | Reduces overfitting risk compared to random oversampling; should only be applied to training data, never to validation or test sets |
| Inverse probability weighting (IPW) | Weights each observation by the inverse of its estimated probability of being included in the sample | Widely used in causal inference; requires a correctly specified selection model |
| Propensity score matching | Matches treated and untreated observations with similar estimated probabilities of treatment, creating a pseudo-randomized comparison | Useful in observational studies where randomization is not possible |
The economist James Heckman developed a two-step statistical method for correcting sample selection bias, work for which he received the Nobel Memorial Prize in Economic Sciences in 2000. The Heckman correction is widely used in econometrics and the social sciences.
The method works in two stages:
By explicitly modeling the probability of inclusion, the Heckman correction can recover unbiased estimates even from non-randomly selected samples, provided the selection model is correctly specified. The method assumes that the errors in the selection and outcome equations follow a bivariate normal distribution.
Sampling bias is one of several types of bias that can affect research and machine learning. The table below summarizes how it relates to other common biases.
| Type of bias | What it affects | How it differs from sampling bias |
|---|---|---|
| Selection bias | Internal and external validity | Broader category that includes sampling bias; also covers biases arising from how participants are assigned to groups within a study |
| Confirmation bias | Interpretation of results | A cognitive bias where researchers favor evidence that supports their preexisting beliefs; affects analysis rather than data collection |
| Measurement bias | Data accuracy | Arises from faulty instruments or inconsistent measurement procedures rather than from how subjects are selected |
| Reporting bias | Published evidence | Occurs when certain results (usually positive ones) are more likely to be published, regardless of how the sample was collected |
| Implicit bias | Data labeling and feature selection | Unconscious preferences that influence which features are collected and how data labeling is performed |
| Prediction bias | Model output calibration | The difference between the average prediction and the average observation in a dataset; may result from sampling bias but can also arise from model architecture |
| Coverage bias | Sampling frame completeness | A subtype of sampling bias that occurs when the sampling frame does not match the target population |