Selection Bias

Selection bias is a systematic error that occurs when the data used for analysis, training, or evaluation does not accurately represent the population or domain it is intended to describe. In machine learning, selection bias arises when the process of collecting, filtering, or curating training data produces a sample that differs systematically from the target distribution. Models trained on biased samples tend to learn distorted patterns, resulting in poor generalization, unfair predictions, and unreliable performance when deployed in real-world settings.

The concept has deep roots in statistics and epidemiology. Joseph Berkson first described a form of selection bias in hospital-based studies in 1946, showing how conditioning on hospital admission could create spurious associations between diseases. Decades later, economist James Heckman developed formal methods for correcting sample selection bias in econometric models, work that earned him the Nobel Memorial Prize in Economic Sciences in 2000. As machine learning systems have become more widespread, selection bias has emerged as one of the most common and consequential sources of error in data-driven decision-making.

Explain like I'm 5 (ELI5)

Imagine you want to find out what flavor of ice cream kids like best. But you only ask the kids at a birthday party where chocolate ice cream is being served. Most of them will say "chocolate!" because that is what they are eating right now. You would think everyone loves chocolate the most, but you missed all the kids at home who might prefer vanilla or strawberry. Selection bias is like that: when you only look at part of the picture, you get the wrong answer because your sample is not a fair representation of everybody.

Formal definition

In statistical terms, selection bias occurs when the probability of an observation being included in the sample depends on characteristics related to the outcome of interest. Let $X$ denote input features, $Y$ denote the target variable, and $S$ denote a binary selection indicator where $S = 1$ means the observation is included in the sample. Selection bias is present when:

$$P(X, Y \mid S = 1) \neq P(X, Y)$$

In other words, the joint distribution of features and labels in the observed sample differs from the true population distribution. This inequality can stem from dependence between $S$ and $X$ (covariate shift), dependence between $S$ and $Y$ (outcome-dependent selection), or both.

A related formulation uses importance weighting. If each sample has a known selection probability $P(S = 1 \mid X)$, one can reweight observations by the inverse of this probability to recover unbiased population-level estimates:

$$\hat{\theta}{\text{IPW}} = \frac{1}{n} \sum{i=1}^{n} \frac{f(X_i, Y_i)}{P(S = 1 \mid X_i)}$$

This is the basis of inverse propensity weighting (IPW), which plays a central role in both causal inference and bias correction for machine learning.

Types of selection bias

Selection bias takes many distinct forms depending on how and where the non-representativeness enters the data pipeline.

Sampling bias

Sampling bias occurs when certain members of the population are systematically more or less likely to be included in the dataset. This can happen through convenience sampling (collecting data from whatever sources are easiest to access), geographic concentration, or platform-specific data collection. For example, a sentiment analysis model trained exclusively on English-language Twitter posts will not generalize well to customer feedback submitted through formal email channels, because the language register, demographics, and topic distribution differ substantially.

Self-selection bias

Self-selection bias arises when individuals or entities choose whether to participate in a data-generating process. In online surveys, people who opt in may be systematically more engaged, more opinionated, or more technically literate than the general population. In machine learning, this appears when user-generated training data (such as product reviews or forum posts) overrepresents users who feel strongly about a topic, while the silent majority remains unobserved.

Survivorship bias

Survivorship bias occurs when the dataset includes only observations that have "survived" some selection process, while those that were filtered out, failed, or dropped off remain invisible. In finance, training a stock-picking model only on currently listed companies ignores all the companies that went bankrupt and were delisted. In healthcare, an AI chatbot trained to detect depression based on users who remain engaged over multiple sessions would miss patterns of severe depression, because severely depressed users tend to stop using the application. The classic illustration comes from World War II, when statistician Abraham Wald recommended reinforcing bomber aircraft in the areas that returning planes did not show damage, reasoning that planes hit in those unseen areas never made it back.

Berkson's paradox (collider bias)

Berkson's paradox, also called collider bias, is a specific form of selection bias that arises when sample inclusion depends on two or more variables, creating a spurious association between them. In causal inference terminology, a collider is a variable that is caused by two or more other variables. Conditioning on the collider (for example, by restricting the sample to observations where the collider takes a particular value) opens a non-causal path between its parent variables, producing a misleading correlation.

Joseph Berkson first described this in 1946 using a hospital-based study. Patients admitted to a hospital may have either diabetes or cholecystitis (gallbladder inflammation), and both conditions independently increase the probability of hospitalization. Among hospitalized patients, diabetes and cholecystitis appear negatively correlated, even though no such relationship exists in the general population. The hospital admission variable acts as a collider.

In machine learning contexts, Berkson's paradox can emerge in several ways:

A hiring model trained on applicants who were either highly educated or highly experienced (but rarely both) might incorrectly learn a negative correlation between education and experience.
A content recommendation system trained on viral posts (high reach but low engagement) and niche posts (low reach but deep engagement) might falsely conclude that reach and engagement are inversely related.
A loan default model trained on approved applicants (who were selected for having either high income or high credit score) could learn a spurious negative relationship between income and credit score.

Attrition bias

Attrition bias occurs when participants who drop out of a longitudinal study or stop contributing data are systematically different from those who remain. In clinical trials, sicker patients may withdraw because the treatment is not working, leaving only healthier patients in the dataset and making the treatment appear more effective than it actually is. In machine learning, attrition bias manifests when user churn is non-random. A recommendation system trained on long-term user interactions may overfit to the preferences of loyal users while failing to serve the needs of casual or dissatisfied users who left the platform.

Coverage bias

Coverage bias occurs when certain segments of the target population are entirely absent from the data collection process. A facial recognition system trained on images scraped from social media in North America and Europe will underrepresent populations from Africa, South Asia, and other regions, leading to lower accuracy for those groups. Coverage bias differs from sampling bias in degree: while sampling bias involves underrepresentation, coverage bias involves complete exclusion.

Non-response bias

Non-response bias arises when the individuals who do not provide data differ systematically from those who do. In survey-based data collection, individuals from certain demographic groups may be less likely to respond. When these non-responses correlate with the outcome variable, models trained on the collected data will produce skewed predictions. For instance, a customer satisfaction model built on survey responses may overestimate overall satisfaction because dissatisfied customers are less likely to respond.

Exclusion bias

Exclusion bias occurs during data preprocessing when certain records are systematically removed based on criteria that correlate with the outcome. Removing incomplete records (listwise deletion) can introduce bias if the missingness is not random. For example, if patients with severe conditions are more likely to have missing lab values because they were too ill to complete all tests, excluding those records biases the dataset toward milder cases.

Time interval bias

Time interval bias arises when the time window chosen for data collection does not represent the full temporal distribution of the phenomenon under study. Training a retail demand forecasting model only on data from the holiday season would produce predictions that overestimate baseline demand. Similarly, stopping a clinical trial early when interim results look promising can exaggerate the treatment effect.

Comparison of selection bias types

Type	Mechanism	Stage introduced	Example in ML
Sampling bias	Non-random inclusion from population	Data collection	Training on English-only web text for a multilingual model
Self-selection bias	Subjects choose to participate	Data collection	Learning from voluntary product reviews
Survivorship bias	Only "survivors" observed	Data collection/curation	Stock prediction trained on currently listed companies only
Berkson's paradox	Conditioning on a collider variable	Data collection/filtering	Hospital study creates spurious disease correlations
Attrition bias	Differential dropout over time	Longitudinal data	Users who churn are absent from training data
Coverage bias	Population segments entirely missing	Data collection	Facial recognition trained without certain demographics
Non-response bias	Differential response rates	Data collection	Surveys where dissatisfied users do not respond
Exclusion bias	Systematic removal during preprocessing	Data preprocessing	Dropping records with missing values
Time interval bias	Unrepresentative time window	Data collection	Training on holiday-season data for year-round prediction

Selection bias and causal inference

Selection bias is closely connected to problems in causal inference, particularly when analyzed through the framework of directed acyclic graphs (DAGs). A DAG represents causal relationships as directed edges between variables. Selection bias corresponds to conditioning on a collider or a descendant of a collider, which opens a spurious path between variables that are not causally related.

Consider the following causal structure:

X --> S <-- Y

Here, both $X$ and $Y$ influence selection $S$. In the full population, $X$ and $Y$ may be independent. But within the selected sample (where $S = 1$), $X$ and $Y$ become spuriously correlated. Any model trained on this selected sample will learn a relationship between $X$ and $Y$ that does not hold in the general population.

This DAG-based perspective clarifies why selection bias is so difficult to address with standard statistical techniques alone. If the selection mechanism is unknown or unobserved, no amount of post-hoc analysis on the selected sample can fully correct the bias. The only reliable solutions involve either collecting data from the full population, modeling the selection mechanism explicitly, or using instrumental variables that affect selection but not the outcome.

Relationship to covariate shift and domain adaptation

Selection bias is closely related to the machine learning concepts of covariate shift and domain adaptation. Covariate shift occurs when the input distribution changes between training and deployment, while the conditional distribution of outputs given inputs remains the same:

$$P_{\text{train}}(X) \neq P_{\text{test}}(X), \quad P_{\text{train}}(Y \mid X) = P_{\text{test}}(Y \mid X)$$

This is a specific form of selection bias where the selection depends only on the features $X$ and not on the label $Y$. When selection depends on $Y$ as well, the problem becomes more complex and falls under the broader category of dataset shift.

Domain adaptation methods attempt to bridge the gap between a source domain (training data) and a target domain (deployment environment) by learning representations that are invariant across domains or by reweighting source samples to match the target distribution. These techniques directly address the consequences of selection bias by explicitly acknowledging and correcting for distributional differences.

Real-world case studies

Facial recognition and the Gender Shades study

One of the most prominent demonstrations of selection bias in AI was the 2018 Gender Shades study conducted by Joy Buolamwini and Timnit Gebru at MIT. The study evaluated commercial gender classification systems from IBM, Microsoft, and Face++ using a balanced dataset of 1,270 faces with equal representation across gender and skin type. The results revealed stark disparities: error rates for darker-skinned women reached up to 34.7%, while error rates for lighter-skinned men were as low as 0.8%. The root cause was selection bias in training data. Standard computer vision datasets like ImageNet drew their images primarily from search engines like Flickr, which overrepresent males, lighter-skinned individuals, and adults between 18 and 40. Models trained on these datasets inherited these demographic imbalances, producing systems that worked well for the overrepresented groups and poorly for everyone else.

Amazon's hiring algorithm

In 2018, Amazon discontinued an internal AI recruiting tool after discovering that it systematically penalized resumes from women. The system was trained on resumes submitted over a 10-year period, during which the majority of successful hires were men (reflecting the existing gender imbalance in the tech industry). The training data thus exhibited survivorship bias and self-selection bias: the model learned to prefer language patterns and credentials more common in male applicants, such as the word "executed." The algorithm penalized resumes that included the word "women's" (as in "women's chess club"), demonstrating how historical selection bias in human decision-making can become amplified when encoded into automated systems.

Healthcare risk prediction

A widely cited 2019 study published in Science by Obermeyer et al. found that a commercial algorithm used to manage the health of approximately 200 million Americans exhibited significant racial bias. The algorithm used healthcare spending as a proxy for healthcare need, but because Black patients historically had less access to healthcare and consequently spent less, the algorithm systematically underestimated their health needs. At a given risk score, Black patients were considerably sicker than white patients with the same score. This is a case of selection bias compounded by proxy variable bias: the training data reflected the existing inequities in healthcare access rather than true health status.

COMPAS recidivism prediction

The COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) system, used across the United States to predict recidivism risk, became the subject of intense debate after a 2016 ProPublica investigation found that the system produced higher false positive rates for Black defendants compared to white defendants. While the system's developer, Northpointe (now Equivant), argued that the tool satisfied a different definition of fairness, the controversy highlighted how selection bias in historical criminal justice data (which reflects decades of racially disparate policing and sentencing) can propagate into predictive models, perpetuating systemic inequities.

NLP and web-scraped training corpora

Large language models trained on web-scraped text inherit the selection biases present in online content. Web text overrepresents English speakers, younger demographics, users from wealthier countries, and individuals with internet access. Research by Hovy and Spruit (2016) and others has shown that NLP models trained on these corpora exhibit demographic biases, including lower performance on text produced by older adults, minority ethnic groups, and speakers of non-standard dialects. The Common Crawl corpus, a widely used training source, reflects the content and perspectives of websites that are well-indexed by search engines, systematically underrepresenting voices from communities with lower internet penetration.

Detecting selection bias

Identifying selection bias in a dataset is often more difficult than correcting it, because the bias may be invisible if the analyst only has access to the selected sample. Several approaches can help.

Comparing distributions

When auxiliary information about the target population is available (such as census data or known demographic distributions), one can compare the training data distribution against the population distribution. Statistical tests like the Kolmogorov-Smirnov (KS) test, chi-squared tests, and the maximum mean discrepancy (MMD) can quantify distributional differences across individual features or in aggregate.

Subgroup performance analysis

Evaluating model performance across demographic subgroups, geographic regions, time periods, or data sources can reveal patterns consistent with selection bias. If a model performs significantly worse on certain subgroups, this may indicate that those groups were underrepresented or absent from the training data.

Missing data analysis

Examining patterns of missing data can reveal selection mechanisms. If missingness is correlated with the target variable or with sensitive attributes, this suggests that the observed data is a non-random subset of the full population. Little's MCAR test can formally test whether data is missing completely at random.

Domain classifier method

A practical technique for detecting covariate shift involves training a binary classifier to distinguish between training data and deployment data (or training data and a reference population sample). If the classifier achieves accuracy significantly above 50%, it indicates that the two distributions differ, suggesting selection bias. The features most predictive of the source domain can reveal which dimensions of the data are most affected.

Mitigating selection bias

Data collection strategies

The most effective way to address selection bias is to prevent it during data collection. Strategies include:

Probability sampling: Using simple random sampling, stratified sampling, or cluster sampling to ensure each member of the population has a known, non-zero probability of inclusion.
Stratified collection: Deliberately oversampling underrepresented groups to ensure adequate coverage across all relevant subpopulations.
Active data acquisition: Using active learning techniques to identify and prioritize the collection of data points from underrepresented regions of the feature space.
Multiple data sources: Combining data from diverse sources (surveys, administrative records, sensor data) to reduce the impact of any single source's selection mechanism.

Reweighting and importance sampling

When collecting new data is not feasible, reweighting existing observations can partially correct for selection bias. Inverse propensity weighting (IPW) assigns each observation a weight equal to the inverse of its estimated selection probability. Observations that were less likely to be selected receive higher weights, effectively upsampling underrepresented portions of the population.

The selection probability (propensity score) can be estimated using logistic regression, random forests, or other classification methods. The key challenge is that the propensity model must be correctly specified. If it omits important variables that influence selection, the resulting weights will be inaccurate and may even increase bias.

Method	Mechanism	Strengths	Limitations
Inverse propensity weighting	Reweight by inverse selection probability	Retains all data, well-understood theory	Sensitive to propensity model misspecification, high-variance weights
Heckman correction	Two-stage model with selection equation	Corrects for unobserved selection factors (under assumptions)	Requires normality assumption and exclusion restriction
SMOTE	Synthetic oversampling of minority class	Simple to implement, increases minority representation	Does not address the root cause of selection, may introduce noise
Stratified sampling	Proportional or balanced sampling by subgroup	Ensures representation across known groups	Requires knowledge of relevant strata before collection
Domain adaptation	Learn domain-invariant representations	Handles complex distributional shifts	Computationally expensive, may lose discriminative information
Data augmentation	Generate additional training examples	Increases effective sample size for underrepresented groups	Augmented data may not reflect true population variation

Heckman correction

The Heckman correction, developed by James Heckman in 1979, is a two-stage statistical method designed specifically for sample selection bias. In the first stage, a probit model estimates the probability of each observation being selected into the sample. From this model, the inverse Mills ratio is computed for each observation. In the second stage, the inverse Mills ratio is included as an additional covariate in the outcome regression, correcting for the bias introduced by non-random selection.

The Heckman correction requires two assumptions: (1) the errors in the selection and outcome equations are jointly normally distributed, and (2) an exclusion restriction exists, meaning at least one variable affects selection but not the outcome. When these assumptions hold, the method provides consistent and unbiased estimates. When they are violated, the correction can inflate standard errors or produce misleading results.

Cross-validation and robustness checks

Cross-validation helps detect selection bias by evaluating model performance across multiple data splits. If performance varies significantly across folds, this may indicate that certain subsets of the data have different distributional properties. Techniques like leave-one-group-out cross-validation (where each fold corresponds to a different subgroup, time period, or data source) are particularly useful for assessing whether a model generalizes beyond the specific selection patterns in the training data.

Adversarial debiasing

Adversarial debiasing uses a two-network architecture in which a primary model learns to make predictions while an adversarial network simultaneously tries to predict sensitive attributes (such as demographic group membership) from the primary model's outputs. The primary model is trained to minimize prediction error while also minimizing the adversary's ability to detect group membership, encouraging the model to learn representations that are less dependent on the biased selection patterns in the training data.

Causal modeling

When the selection mechanism can be represented as a DAG, causal modeling techniques can identify and correct for selection bias. By specifying which variables influence selection and which influence the outcome, researchers can determine which adjustments are necessary and which would introduce additional bias. Instrumental variable methods, which exploit variables that affect selection but not the outcome, provide another avenue for correcting selection bias when direct adjustment is insufficient.

Selection bias versus other forms of bias

Selection bias is one of several types of bias that can affect machine learning systems. Understanding how it relates to other biases helps practitioners identify and address the correct problem.

Bias type	Definition	Cause	Relationship to selection bias
Selection bias	Non-representative sample	Data collection or filtering process	(this article)
Confirmation bias	Tendency to seek evidence supporting existing beliefs	Human judgment in data collection or interpretation	Can cause selection bias when data collectors favor confirming data
Measurement bias	Systematic error in how variables are recorded	Faulty instruments or inconsistent protocols	Can co-occur with selection bias but involves measurement, not sampling
Label bias	Errors or subjectivity in annotation	Annotator subjectivity, unclear guidelines	Distinct from selection bias but can compound its effects
Algorithmic bias	Model produces systematically unfair outputs	Biased data, model design, or objective function	Often a downstream consequence of selection bias in training data
Reporting bias	Selective reporting of results	Publication incentives, positive result bias	A form of selection bias applied to research findings rather than data
Inductive bias	Assumptions built into the learning algorithm	Model architecture and design choices	Unrelated to data selection; concerns model structure

Impact on model fairness and ethics

Selection bias has direct consequences for the fairness of machine learning systems. When certain demographic groups are underrepresented or misrepresented in training data, models trained on that data will produce less accurate predictions for those groups. This can lead to violations of several formal fairness criteria:

Demographic parity: If the selected sample overrepresents one group, the model may assign favorable outcomes disproportionately to that group.
Equalized odds: If model performance (true positive and false positive rates) differs across groups due to selection bias, the system fails to treat groups equally.
Individual fairness: If similar individuals from underrepresented groups receive systematically different predictions due to data gaps, the model violates the principle that similar individuals should be treated similarly.

Beyond technical fairness metrics, selection bias raises broader ethical concerns. Automated systems that perpetuate historical inequities (as in the COMPAS and healthcare examples above) can entrench discrimination, reduce accountability, and erode public trust in AI systems. Regulatory frameworks like the EU AI Act increasingly require organizations to demonstrate that their AI systems do not exhibit unjustified bias, making the detection and correction of selection bias a legal as well as ethical obligation.

Best practices for practitioners

Document data provenance. Record how data was collected, what populations it represents, and what filtering or preprocessing was applied. Data cards and datasheets for datasets (as proposed by Gebru et al., 2021) provide standardized formats for this documentation.
Audit training data demographics. Before training, compare the distribution of key variables in the training data against known population statistics to identify gaps.
Evaluate on held-out subgroups. Test model performance not only on an overall test set but also on subgroups defined by sensitive attributes, geography, time period, and data source.
Use multiple metrics. A single accuracy number can mask selection bias. Report precision, recall, false positive rates, and false negative rates across subgroups.
Apply reweighting when appropriate. When the selection mechanism is understood, use propensity scores or importance weights to adjust the training distribution.
Iterate on data collection. Treat data collection as an ongoing process. Use model error analysis to identify underrepresented regions and collect additional data to fill gaps.
Engage domain experts. Statisticians, social scientists, and affected communities can identify selection mechanisms that may not be apparent from the data alone.

References

Berkson, J. (1946). "Limitations of the application of fourfold table analysis to hospital data." *Biometrics Bulletin*, 2(3), 47-53.
Heckman, J. J. (1979). "Sample selection bias as a specification error." *Econometrica*, 47(1), 153-161.
Buolamwini, J., & Gebru, T. (2018). "Gender Shades: Intersectional accuracy disparities in commercial gender classification." *Proceedings of Machine Learning Research*, 81, 1-15.
Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). "Dissecting racial bias in an algorithm used to manage the health of populations." *Science*, 366(6464), 447-453.
Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). "Machine bias." *ProPublica*. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
Cortes, C., Mohri, M., Riley, M., & Rostamizadeh, A. (2008). "Sample selection bias correction theory." *Proceedings of the 19th International Conference on Algorithmic Learning Theory*, 38-53.
Shimodaira, H. (2000). "Improving predictive inference under covariate shift by weighting the log-likelihood function." *Journal of Statistical Planning and Inference*, 90(2), 227-244.
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daume III, H., & Crawford, K. (2021). "Datasheets for datasets." *Communications of the ACM*, 64(12), 86-92.
Hovy, D., & Spruit, S. L. (2016). "The social impact of natural language processing." *Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics*, 591-598.
Zadrozny, B. (2004). "Learning and evaluating classifiers under sample selection bias." *Proceedings of the 21st International Conference on Machine Learning*.
Hernán, M. A., Hernández-Diaz, S., & Robins, J. M. (2004). "A structural approach to selection bias." *Epidemiology*, 15(5), 615-625.
Dastin, J. (2018). "Amazon scraps secret AI recruiting tool that showed bias against women." *Reuters*. https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). "A survey on bias and fairness in machine learning." *ACM Computing Surveys*, 54(6), 1-35.

Explain like I'm 5 (ELI5)

Formal definition

Types of selection bias

Sampling bias

Self-selection bias

Survivorship bias

Berkson's paradox (collider bias)

Attrition bias

Coverage bias

Non-response bias

Exclusion bias

Time interval bias

Comparison of selection bias types

Selection bias and causal inference

Relationship to covariate shift and domain adaptation

Real-world case studies

Facial recognition and the Gender Shades study

Amazon's hiring algorithm

Healthcare risk prediction

COMPAS recidivism prediction

NLP and web-scraped training corpora

Detecting selection bias

Comparing distributions

Subgroup performance analysis

Missing data analysis

Domain classifier method

Mitigating selection bias

Data collection strategies

Reweighting and importance sampling

Heckman correction

Cross-validation and robustness checks

Adversarial debiasing

Causal modeling

Selection bias versus other forms of bias

Impact on model fairness and ethics

Best practices for practitioners

References

Improve this article

Related Articles

Sampling Bias

ARC-AGI 2

Reporting Bias

Non-Response Bias

Outlier Detection

Participation Bias

Explain like I'm 5 (ELI5)

Formal definition

Types of selection bias

Sampling bias

Self-selection bias

Survivorship bias

Berkson's paradox (collider bias)

Attrition bias

Coverage bias

Non-response bias

Exclusion bias

Time interval bias

Comparison of selection bias types

Selection bias and causal inference

Relationship to covariate shift and domain adaptation

Real-world case studies

Facial recognition and the Gender Shades study

Amazon's hiring algorithm

Healthcare risk prediction

COMPAS recidivism prediction

NLP and web-scraped training corpora

Detecting selection bias

Comparing distributions

Subgroup performance analysis

Missing data analysis

Domain classifier method

Mitigating selection bias

Data collection strategies

Reweighting and importance sampling

Heckman correction

Cross-validation and robustness checks

Adversarial debiasing

Causal modeling

Selection bias versus other forms of bias

Impact on model fairness and ethics