Sampling Bias

Sampling bias is a systematic error in statistics and machine learning that occurs when a sample is collected in such a way that some members of the intended population have a lower or higher probability of being included than others. This results in a non-representative sample where conclusions drawn from the data may be erroneously attributed to the phenomenon under study rather than to the flawed method of data collection. In statistical terms, sampling bias leads to systematic over- or under-estimation of population parameters, undermining the validity of any analysis built on the biased data.

Sampling bias is distinct from sampling error, which arises from random variation in sample selection. While sampling error relates to precision and decreases with larger sample sizes, sampling bias relates to accuracy and cannot be corrected simply by collecting more data. A biased sample of ten million observations remains biased; the 1936 Literary Digest presidential poll demonstrated this clearly when over two million responses still produced a wildly incorrect prediction.

In machine learning, sampling bias is one of the most common sources of poor model performance in production. A model trained on a biased dataset may achieve high accuracy on its test data while failing on real-world inputs, because both the training and test sets share the same systematic gaps in coverage.

Explain like I'm 5 (ELI5)

Imagine you want to find out what everyone's favorite ice cream flavor is. But instead of asking kids at every school, you only ask kids at the school next to the chocolate ice cream factory. Most of those kids will probably say chocolate because they smell it every day and get free samples. If you then tell people that chocolate is the world's favorite flavor, you would be wrong, because you only asked a special group of kids who had a reason to like chocolate more.

Sampling bias works the same way. When you only look at information from certain kinds of people (or data points), you miss what everyone else thinks, and your answer ends up lopsided.

Formal definition

In probability theory, let a population consist of N individuals, and let a sampling mechanism assign each individual i a selection probability p_i. A sample is unbiased if p_i = 1/N for all i (simple random sampling) or, more generally, if every individual has a known, nonzero probability of selection (probability sampling). Sampling bias occurs when the actual selection probabilities deviate from the intended ones, meaning that for some individuals p_i is systematically too high or too low, or even zero.

If theta is the population parameter of interest (for example, the mean) and theta_hat is the estimator calculated from the sample, then the bias is defined as:

Bias(theta_hat) = E[theta_hat] - theta

When this bias is nonzero and arises from the way the sample was selected rather than from the estimator itself, it constitutes sampling bias. Because this bias is systematic, increasing the sample size n does not reduce it. Only changes to the sampling procedure or post-hoc corrections (such as reweighting) can address it.

Relationship to selection bias

Sampling bias is usually classified as a subtype of selection bias, though the two terms are often used interchangeably. A useful distinction is that sampling bias primarily threatens external validity (the ability to generalize results to the full population), while selection bias more broadly addresses internal validity (whether differences within the sample reflect genuine effects or artifacts of how participants were chosen). In practice, both concepts overlap considerably, and many sources treat them as synonyms.

Types of sampling bias

Sampling bias manifests in many forms depending on the mechanism that distorts the sample. The table below summarizes the most widely recognized types.

Type	Description	Example
Self-selection bias	Occurs when individuals volunteer to participate, and those who choose to participate differ systematically from those who do not	Online satisfaction surveys tend to attract people with strong opinions (very satisfied or very dissatisfied), while indifferent users rarely respond
Non-response bias	Arises when people who do not respond to a survey differ from those who do on variables of interest	In the U.S. Census Bureau's American Community Survey during 2020, low-earning households were much less likely to respond, biasing income estimates upward and poverty estimates downward
Survivorship bias	Results from focusing only on subjects that "survived" a selection process while ignoring those that did not	Analyzing only currently successful companies to draw business lessons ignores the many companies that tried the same strategies and failed
Undercoverage bias	Occurs when certain segments of the population are excluded or underrepresented in the sampling frame	A telephone survey conducted only via landlines misses people who only use mobile phones, typically younger and lower-income individuals
Overcoverage bias	Occurs when some population members appear multiple times in the sampling frame, inflating their selection probability	A mailing list with duplicate entries causes certain individuals to receive multiple survey invitations, making them more likely to be counted
Convenience bias	Results from selecting participants based on ease of access rather than randomization	A psychology study that recruits only college undergraduates may not generalize to the broader adult population
Reporting bias	Occurs when certain outcomes are more likely to be published or reported, skewing the available evidence	Medical journals historically published positive drug trial results more often than negative ones, creating a skewed picture of treatment effectiveness
Healthy user bias	The study population is systematically healthier than the general population	Studies on occupational health among manual laborers miss workers who left the occupation due to illness, overestimating the health of the remaining workers
Berkson's bias (admission rate bias)	A spurious association between diseases observed in hospital-based studies, because having either condition increases the probability of hospitalization	A 1946 study by Joseph Berkson showed that hospital patients without diabetes appeared more likely to have cholecystitis, simply because they needed some reason to be admitted
Temporal bias	Data collected during a specific time window does not represent the population across different time periods	Training a fraud detection model exclusively on holiday-season transaction data may cause poor performance during normal spending periods
Participation bias	The act of participating in a study changes the behavior or characteristics of participants	Patients enrolled in clinical trials may receive more attentive care than the general patient population, independent of the treatment being studied
Pre-screening bias	How a study is advertised or screened determines who sees and responds to it	An online ad for a health study on a fitness website attracts health-conscious respondents who are not representative of the general population

Survivorship bias in depth

Survivorship bias deserves special attention because of its pervasiveness and its often counterintuitive nature. The classic example comes from World War II. The U.S. military examined bullet damage on bombers returning from combat missions and initially proposed reinforcing the most heavily damaged areas (fuselage and wings). The mathematician Abraham Wald, working with the Statistical Research Group at Columbia University, recognized that this analysis suffered from survivorship bias. The planes being examined were the ones that survived; the bullet holes showed where a plane could take damage and still fly home. The areas with no damage on returning planes were likely the areas where hits proved fatal, because those planes never made it back. Wald recommended reinforcing the undamaged areas instead, and the military adopted his advice.

Survivorship bias appears frequently in everyday reasoning:

Business advice: Books and articles about successful entrepreneurs often highlight traits shared by people like Steve Jobs or Elon Musk without studying the many entrepreneurs who had those same traits and still failed.
Investment funds: Mutual fund performance statistics often exclude funds that were shut down or merged due to poor returns, making the surviving funds' average performance look better than it actually is.
Historical artifacts: Our understanding of prehistoric humans is biased toward cave-dwelling because cave paintings survived, while any art on trees, animal skins, or hillsides has long since decayed. This is sometimes called the "caveman effect."
Music and literature: We remember only the best works from past centuries, leading to the false impression that older art was uniformly superior to modern art.

Self-selection bias in depth

Self-selection bias (also called volunteer bias) is one of the most common forms of sampling bias in research and data collection. It occurs when individuals decide for themselves whether to participate in a study, and the decision to participate is correlated with the variables being studied.

Research has consistently found that volunteers tend to:

Have higher levels of education
Come from higher socioeconomic backgrounds
Be more extroverted and socially active
Hold stronger opinions on the topic under study
Be female (women volunteer for studies at higher rates than men in many contexts)

This type of bias is especially problematic in online surveys and phone-in polls, where participation is entirely voluntary. As a result, these instruments tend to produce a "polarization of responses," with extreme perspectives receiving disproportionate weight while moderate views are underrepresented.

In machine learning, self-selection bias appears when user-generated data forms the training data. For example, product review datasets are dominated by users who feel strongly enough to write a review, while the silent majority of satisfied (but not enthusiastic) customers are absent from the data.

Berkson's bias in depth

Berkson's bias (also known as Berkson's paradox or admission rate bias) is a form of sampling bias specific to studies conducted within hospitals or clinics. First described by the biostatistician Joseph Berkson in 1946, it occurs because the combination of a disease and an exposure both independently increase the probability of hospital admission. When cases and controls are both drawn from a hospital population, this shared pathway to admission creates a spurious (usually negative) correlation between the disease and the exposure.

For example, suppose a researcher wants to study whether diabetes is associated with cholecystitis (gallbladder disease). If the study recruits both cases and controls from hospital patients, the controls (patients without diabetes) must have been admitted for some other reason, making them more likely to have cholecystitis. This artificially inflates the apparent association between diabetes and cholecystitis.

The solution to Berkson's bias is straightforward in principle: use population-based sampling rather than hospital-based sampling. When every member of the population has an equal chance of being selected, the distortion introduced by differential admission rates disappears.

Historical examples

Several well-known historical cases illustrate the consequences of sampling bias.

The Literary Digest poll of 1936

The Literary Digest magazine conducted one of the largest polls in history to predict the outcome of the 1936 U.S. presidential election between Franklin D. Roosevelt and Alf Landon. The magazine mailed over 10 million questionnaires and received approximately 2.4 million responses. Based on these responses, the Digest predicted that Landon would win decisively with 57% of the vote.

Roosevelt won in a landslide with 62% of the popular vote, carrying 46 of 48 states.

The poll's failure stemmed from two forms of sampling bias. First, the mailing lists were drawn from telephone directories, automobile registration records, and country club memberships. During the Great Depression, these sources over-represented wealthy Americans, who were more likely to favor the Republican candidate. Second, of the 10 million people contacted, only 2.4 million responded (a 24% response rate), and Landon supporters were disproportionately motivated to return their ballots, introducing severe nonresponse bias.

Meanwhile, George Gallup's organization correctly predicted Roosevelt's victory using a carefully selected sample of just 50,000 citizens. This event established two lasting lessons: a large sample does not compensate for a biased sampling method, and how a sample is selected matters far more than how large it is.

The 1948 "Dewey Defeats Truman" headline

In 1948, the Chicago Tribune published its famous "DEWEY DEFEATS TRUMAN" headline based on telephone polls that predicted Thomas Dewey would defeat Harry Truman. The polls suffered from coverage bias because telephones were not yet widespread; people who owned phones tended to be wealthier and more likely to vote Republican. The poll used was also two weeks old, missing a late swing in voter sentiment.

COVID-19 case fatality rate estimates

During the early months of the COVID-19 pandemic, wide variations in testing policies across countries introduced substantial sampling bias into case counts. Countries that tested primarily hospitalized patients reported much higher case fatality rates than countries that conducted broader community testing. These differences were largely artifacts of sampling rather than genuine differences in disease severity. Researchers showed that variations in sampling bias accounted for much of the observed international variation in both case fatality rates and the apparent age distribution of cases.

Sampling bias in machine learning

Sampling bias is one of the most significant sources of bias in machine learning systems. Because ML models learn patterns from their training data, any systematic gaps or distortions in the data are directly reflected in the model's predictions.

How sampling bias affects model training

When a dataset is not representative of the population the model will encounter in production, several problems arise:

Poor generalization: The model may achieve high performance on its test set (which shares the same bias) but fail on real-world inputs from underrepresented groups.
Systematic discrimination: If certain demographic groups are underrepresented or absent from the training data, the model may produce less accurate or outright incorrect predictions for those groups.
Feedback loops: In recommendation systems and ad-targeting, biased training data leads to biased recommendations, which generate biased user interaction data, reinforcing and amplifying the original bias over time.
Miscalibrated confidence: A model trained on biased data may produce overconfident predictions for underrepresented cases because it has never encountered evidence that would temper its confidence.

Notable examples in AI

System or dataset	Type of sampling bias	Consequence
ImageNet	Geographic and demographic undercoverage	The dataset over-represented lighter-skinned individuals from Western countries, leading to lower accuracy on images of people from other regions. ImageNet later removed over 500,000 images from its "person" category after the biases were exposed.
Commercial facial recognition	Demographic undercoverage	A 2018 study by Joy Buolamwini and Timnit Gebru found that gender classification error rates for darker-skinned women were up to 34.7%, compared to 0.8% for lighter-skinned men, in commercial systems from major vendors.
COMPAS recidivism tool	Racial representation bias	ProPublica's 2016 analysis found that the COMPAS system's false positive rate (predicting recidivism when it did not occur) was significantly higher for Black defendants than for white defendants, raising questions about whether the training data reflected existing disparities in the criminal justice system.
Healthcare risk prediction	Historical utilization bias	A 2019 study published in Science found that a widely used algorithm for predicting healthcare needs assigned systematically lower risk scores to Black patients. The algorithm used healthcare spending as a proxy for health needs, but because Black patients historically had less access to healthcare, their lower spending did not reflect lower medical need.
Cardiac MRI segmentation	Demographic undercoverage	A deep learning model trained on data that was 80% White achieved a Dice Similarity Coefficient of 93.5% for White subjects but only 84.5% for Black and Mixed-race subjects.

Temporal bias and concept drift

Temporal bias is a form of sampling bias where the training data reflects conditions from a specific time period that may not hold in the future. This is closely related to the concept of concept drift, where the statistical relationship between input features and output labels changes over time.

Examples include:

A spam filter trained on 2015-era phishing emails may fail on modern phishing attempts that use more sophisticated language and social engineering.
A credit scoring model trained during an economic boom may produce unreliable predictions during a recession because spending and repayment patterns shift.
A recommendation system trained on pre-pandemic user behavior may not reflect the changed preferences that emerged during and after COVID-19 lockdowns.

Dealing with temporal bias typically requires continuous monitoring of model performance and periodic retraining on updated data. Many production ML systems implement automated data drift detection to alert engineers when the distribution of incoming data diverges significantly from the training distribution.

Detecting sampling bias

Identifying sampling bias is the first step toward addressing it. Several approaches are used in both traditional statistics and machine learning.

Statistical methods

Comparing sample demographics to population demographics: If census data or other population-level information is available, researchers can compare the distribution of key variables in their sample (age, gender, income, geography) against known population distributions.
Chi-squared tests: These tests can determine whether observed differences in the distribution of categorical variables across sample subgroups are statistically significant.
t-tests and ANOVA: For continuous variables, these tests compare means across groups to identify whether certain subgroups are over- or under-represented.
Nonresponse analysis: Comparing early respondents to late respondents, or comparing respondents to known characteristics of nonrespondents, can reveal patterns of nonresponse bias.

Machine learning methods

Fairness audits: Tools like IBM's AI Fairness 360 provide metrics to detect bias in datasets and models, including disparate impact ratios, equalized odds assessments, and demographic parity checks.
Subgroup performance analysis: Evaluating model performance separately for each demographic group or subpopulation can reveal disparities that aggregate metrics would hide. A model with 95% overall accuracy might have 99% accuracy for the majority group and only 70% for a minority group.
Distribution comparison: Visualizing and statistically comparing the distributions of features in the training data against those in the production data can reveal systematic gaps.
Counterfactual testing: Changing protected attributes (such as race or gender) in input data and observing whether model outputs change can reveal sensitivity to variables that should ideally not influence predictions.

Mitigation techniques

Mitigating sampling bias requires intervention at different stages of the data collection and modeling pipeline.

At the data collection stage

Technique	Description	When to use
Simple random sampling	Every member of the population has an equal probability of being selected	When a complete list of the population (sampling frame) is available
Stratified sampling	The population is divided into subgroups (strata) based on key characteristics, and random samples are drawn from each stratum in proportion to its population share	When specific subgroups must be adequately represented, such as minority demographic groups
Cluster sampling	The population is divided into clusters (often geographic), a random selection of clusters is chosen, and all or some members within those clusters are sampled	When a complete population list is unavailable but clusters can be identified
Systematic sampling	Every kth member of an ordered population list is selected after a random starting point	When the population can be listed but simple random selection is impractical
Oversampling minority groups	Intentionally sampling underrepresented groups at higher rates, then applying weights to produce population-representative estimates	When certain groups are rare in the population but must be well-represented for analysis

At the data preprocessing stage

Technique	Description	Considerations
Sample reweighting	Assigning weights to each observation so that the weighted sample matches the population distribution. Each sample receives a weight equal to its population proportion divided by its sampling proportion.	Requires knowledge of the true population distribution; can increase variance if weights are extreme
Oversampling (random)	Duplicating existing minority class observations to balance class proportions	Risks overfitting because the model sees identical copies of minority examples
Undersampling (random)	Removing majority class observations to balance class proportions	Loses potentially useful information from the majority class
SMOTE (Synthetic Minority Over-sampling Technique)	Generates synthetic minority class examples by interpolating between existing minority class neighbors rather than duplicating them	Reduces overfitting risk compared to random oversampling; should only be applied to training data, never to validation or test sets
Inverse probability weighting (IPW)	Weights each observation by the inverse of its estimated probability of being included in the sample	Widely used in causal inference; requires a correctly specified selection model
Propensity score matching	Matches treated and untreated observations with similar estimated probabilities of treatment, creating a pseudo-randomized comparison	Useful in observational studies where randomization is not possible

At the modeling stage

Cost-sensitive learning: Assigning higher misclassification costs to underrepresented classes so the model pays more attention to them during training.
Cross-validation: Using techniques like k-fold cross-validation helps assess whether model performance is consistent across different subsets of the data, though it does not correct bias in the data itself.
Fairness constraints: Incorporating fairness metrics such as equalized odds or demographic parity directly into the model's objective function during training.
Ensemble methods: Training multiple models on different subsets or resampled versions of the data and combining their predictions can sometimes reduce the effect of bias present in any single subset.

At the post-deployment stage

Continuous monitoring: Tracking model performance across demographic groups and over time to detect emerging biases.
Data drift detection: Implementing automated systems that compare incoming data distributions against training data distributions and trigger alerts or retraining when significant shifts occur.
Periodic retraining: Refreshing models with new data on a regular schedule to reduce temporal bias.
Human-in-the-loop review: Having domain experts review model predictions for high-stakes decisions, particularly for cases involving underrepresented populations.

The Heckman correction

The economist James Heckman developed a two-step statistical method for correcting sample selection bias, work for which he received the Nobel Memorial Prize in Economic Sciences in 2000. The Heckman correction is widely used in econometrics and the social sciences.

The method works in two stages:

Selection equation: A probit model estimates the probability that each observation is included in the sample, based on observable characteristics.
Outcome equation: The inverse Mills ratio (derived from the selection equation) is included as an additional variable in the regression of interest. This ratio acts as a control function that accounts for the selection mechanism.

By explicitly modeling the probability of inclusion, the Heckman correction can recover unbiased estimates even from non-randomly selected samples, provided the selection model is correctly specified. The method assumes that the errors in the selection and outcome equations follow a bivariate normal distribution.

Sampling bias vs. other types of bias

Sampling bias is one of several types of bias that can affect research and machine learning. The table below summarizes how it relates to other common biases.

Type of bias	What it affects	How it differs from sampling bias
Selection bias	Internal and external validity	Broader category that includes sampling bias; also covers biases arising from how participants are assigned to groups within a study
Confirmation bias	Interpretation of results	A cognitive bias where researchers favor evidence that supports their preexisting beliefs; affects analysis rather than data collection
Measurement bias	Data accuracy	Arises from faulty instruments or inconsistent measurement procedures rather than from how subjects are selected
Reporting bias	Published evidence	Occurs when certain results (usually positive ones) are more likely to be published, regardless of how the sample was collected
Implicit bias	Data labeling and feature selection	Unconscious preferences that influence which features are collected and how data labeling is performed
Prediction bias	Model output calibration	The difference between the average prediction and the average observation in a dataset; may result from sampling bias but can also arise from model architecture
Coverage bias	Sampling frame completeness	A subtype of sampling bias that occurs when the sampling frame does not match the target population

Best practices for avoiding sampling bias

Define the target population precisely before data collection begins, and verify that the sampling frame covers it adequately.
Use probability sampling methods whenever possible (simple random, stratified, cluster, or systematic sampling).
Maximize response rates through follow-up contacts, incentives, and accessible survey design, while recognizing that high response rates alone do not guarantee low bias.
Document the sampling procedure in detail so that potential biases can be identified and assessed by others.
Compare sample characteristics to population benchmarks using census data or other reliable sources.
Apply post-stratification weights when certain groups are known to be underrepresented.
Evaluate model performance across subgroups, not just in aggregate, to catch disparities hidden by overall metrics.
Use diverse data sources rather than relying on a single collection method, which may systematically exclude certain populations.
Be transparent about limitations. Every sample has potential biases; acknowledging them helps consumers of the research interpret results appropriately.
Monitor for temporal drift in production ML systems and retrain models as the data distribution shifts.

References

Heckman, J. J. (1979). "Sample Selection Bias as a Specification Error." *Econometrica*, 47(1), 153-161.
Wald, A. (1943). "A Method of Estimating Plane Vulnerability Based on Damage of Survivors." Statistical Research Group, Columbia University. Republished in 1980 by the Center for Naval Analyses.
Berkson, J. (1946). "Limitations of the Application of Fourfold Table Analysis to Hospital Data." *Biometrics Bulletin*, 2(3), 47-53.
Buolamwini, J., & Gebru, T. (2018). "Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification." *Proceedings of Machine Learning Research*, 81, 1-15.
Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). "Dissecting racial bias in an algorithm used to manage the health of populations." *Science*, 366(6464), 447-453.
Squire, P. (1988). "Why the 1936 Literary Digest Poll Failed." *Public Opinion Quarterly*, 52(1), 125-133.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). "SMOTE: Synthetic Minority Over-sampling Technique." *Journal of Artificial Intelligence Research*, 16, 321-357.
Angrist, J. D., & Pischke, J.-S. (2008). *Mostly Harmless Econometrics: An Empiricist's Companion*. Princeton University Press.
Groves, R. M. (2006). "Nonresponse Rates and Nonresponse Bias in Household Surveys." *Public Opinion Quarterly*, 70(5), 646-675.
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). "A Survey on Bias and Fairness in Machine Learning." *ACM Computing Surveys*, 54(6), 1-35.
Griffith, G. J., et al. (2020). "Collider bias undermines our understanding of COVID-19 disease risk and severity." *Nature Communications*, 11(1), 5749.
Mangold, L., & Hangartner, D. (2024). "Sampling Bias in Machine Learning Models." *Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society*.
Cortes, C., & Mohri, M. (2014). "Domain Adaptation and Sample Bias Correction Theory and Algorithm for Regression." *Theoretical Computer Science*, 519, 103-126.

Explain like I'm 5 (ELI5)

Formal definition

Relationship to selection bias

Types of sampling bias

Survivorship bias in depth

Self-selection bias in depth

Berkson's bias in depth

Historical examples

The Literary Digest poll of 1936

The 1948 "Dewey Defeats Truman" headline

COVID-19 case fatality rate estimates

Sampling bias in machine learning

How sampling bias affects model training

Notable examples in AI

Temporal bias and concept drift

Detecting sampling bias

Statistical methods

Machine learning methods

Mitigation techniques

At the data collection stage

At the data preprocessing stage

At the modeling stage

At the post-deployment stage

The Heckman correction

Sampling bias vs. other types of bias

Best practices for avoiding sampling bias

See also

References

Improve this article

Related Articles

Selection Bias

ARC-AGI 2

Reporting Bias

Non-Response Bias

Outlier Detection

Participation Bias

Explain like I'm 5 (ELI5)

Formal definition

Relationship to selection bias

Types of sampling bias

Survivorship bias in depth

Self-selection bias in depth

Berkson's bias in depth

Historical examples

The Literary Digest poll of 1936

The 1948 "Dewey Defeats Truman" headline

COVID-19 case fatality rate estimates

Sampling bias in machine learning

How sampling bias affects model training

Notable examples in AI

Temporal bias and concept drift

Detecting sampling bias

Statistical methods

Machine learning methods

Mitigation techniques

At the data collection stage

At the data preprocessing stage

At the modeling stage

At the post-deployment stage

The Heckman correction

Sampling bias vs. other types of bias

Best practices for avoiding sampling bias

See also

References

Related Articles

Selection Bias

ARC-AGI 2

Reporting Bias

Non-Response Bias

Outlier Detection

Participation Bias