# Selection Bias

> Source: https://aiwiki.ai/wiki/selection_bias
> Updated: 2026-06-23
> Categories: AI Ethics, Data & Datasets, Machine Learning, Statistics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Selection bias** is a systematic error that occurs when the data used for analysis, training, or evaluation does not accurately represent the population or domain it is intended to describe, because the process that selected the sample was not random. In [machine learning](/wiki/machine_learning), selection bias arises when the process of collecting, filtering, or curating [training data](/wiki/training_set) produces a sample that differs systematically from the target distribution. Models trained on biased samples tend to learn distorted patterns, resulting in poor [generalization](/wiki/generalization), unfair predictions, and unreliable performance when deployed in real-world settings. The most famous illustration is the 1936 Literary Digest poll, which surveyed about 2.27 million people yet wrongly predicted the U.S. presidential election because its sample was drawn from car and telephone owners who skewed wealthier than the electorate.[14]

The concept has deep roots in statistics and epidemiology. Joseph Berkson first described a form of selection bias in hospital-based studies in 1946, showing how conditioning on hospital admission could create spurious associations between diseases.[1] Decades later, economist James Heckman developed formal methods for correcting sample selection bias in econometric models, work that earned him the Nobel Memorial Prize in Economic Sciences in 2000.[2] As machine learning systems have become more widespread, selection bias has emerged as one of the most common and consequential sources of error in data-driven decision-making, where it is a primary driver of [AI bias](/wiki/ai_bias) and [algorithmic unfairness](/wiki/algorithmic_fairness).

## Explain like I'm 5 (ELI5)

Imagine you want to find out what flavor of ice cream kids like best. But you only ask the kids at a birthday party where chocolate ice cream is being served. Most of them will say "chocolate!" because that is what they are eating right now. You would think everyone loves chocolate the most, but you missed all the kids at home who might prefer vanilla or strawberry. Selection bias is like that: when you only look at part of the picture, you get the wrong answer because your sample is not a fair representation of everybody.

## Formal definition

In statistical terms, selection bias occurs when the probability of an observation being included in the sample depends on characteristics related to the outcome of interest. Let $X$ denote input features, $Y$ denote the target variable, and $S$ denote a binary selection indicator where $S = 1$ means the observation is included in the sample. Selection bias is present when:

$$P(X, Y \mid S = 1) \neq P(X, Y)$$

In other words, the joint distribution of features and labels in the observed sample differs from the true population distribution. This inequality can stem from dependence between $S$ and $X$ (covariate shift), dependence between $S$ and $Y$ (outcome-dependent selection), or both.

A related formulation uses importance weighting. If each sample has a known selection probability $P(S = 1 \mid X)$, one can reweight observations by the inverse of this probability to recover unbiased population-level estimates:

$$\hat{\theta}_{\text{IPW}} = \frac{1}{n} \sum_{i=1}^{n} \frac{f(X_i, Y_i)}{P(S = 1 \mid X_i)}$$

This is the basis of inverse propensity weighting (IPW), which plays a central role in both causal inference and bias correction for machine learning.[6][10]

## What is the classic example of selection bias? The 1936 Literary Digest poll

The single most cited example of selection bias is the 1936 United States presidential election forecast by The Literary Digest. The magazine mailed roughly 10 million sample ballots and tabulated about 2.27 million returned responses, an enormous sample for its era. On the strength of those numbers it predicted that Republican Alf Landon would defeat the incumbent Franklin D. Roosevelt with 57 percent of the popular vote to 43 percent.[14] The actual result was almost the reverse: Roosevelt won 62 percent of the popular vote and 523 electoral votes to Landon's 38 percent and 8 electoral votes, one of the largest landslides in U.S. history.[14]

The failure was not caused by too small a sample but by a biased one. The Digest assembled its mailing list from automobile registrations, telephone directories, and its own subscriber rolls, sources that in the depths of the Great Depression skewed heavily toward wealthier households who were both more likely to own cars and phones and more likely to oppose Roosevelt's New Deal. Non-response compounded the problem: the people who chose to mail a ballot back differed systematically from those who did not. In the same election, George Gallup correctly forecast a Roosevelt victory using a far smaller but more representative sample of roughly 50,000 respondents, a result that helped establish modern probability-based polling and effectively ended The Literary Digest, which folded in 1938.[14] The episode is the canonical demonstration that representativeness, not raw sample size, is what protects a study from selection bias.

## What are the main types of selection bias?

Selection bias takes many distinct forms depending on how and where the non-representativeness enters the data pipeline.

### Sampling bias

[Sampling bias](/wiki/sampling_bias) occurs when certain members of the population are systematically more or less likely to be included in the dataset. This can happen through convenience sampling (collecting data from whatever sources are easiest to access), geographic concentration, or platform-specific data collection. For example, a [sentiment analysis](/wiki/sentiment_analysis) model trained exclusively on English-language Twitter posts will not generalize well to customer feedback submitted through formal email channels, because the language register, demographics, and topic distribution differ substantially.

### Self-selection bias

Self-selection bias arises when individuals or entities choose whether to participate in a data-generating process. In online surveys, people who opt in may be systematically more engaged, more opinionated, or more technically literate than the general population. In machine learning, this appears when user-generated training data (such as product reviews or forum posts) overrepresents users who feel strongly about a topic, while the silent majority remains unobserved.

### Survivorship bias

Survivorship bias occurs when the dataset includes only observations that have "survived" some selection process, while those that were filtered out, failed, or dropped off remain invisible. In finance, training a stock-picking model only on currently listed companies ignores all the companies that went bankrupt and were delisted. In healthcare, an AI chatbot trained to detect depression based on users who remain engaged over multiple sessions would miss patterns of severe depression, because severely depressed users tend to stop using the application.

The classic illustration comes from World War II. The mathematician Abraham Wald, working with the Statistical Research Group (SRG) at Columbia University, was asked in 1943 how to add armor to bomber aircraft to reduce losses. The military had mapped the bullet holes on planes that returned from missions and proposed reinforcing the most-hit areas, such as the wings and fuselage. Wald reasoned the opposite: the returning planes were the survivors, so the holes they carried marked hits an aircraft could absorb and still come home. The areas with few holes on survivors, including the engines, were precisely where a hit was likely fatal, because planes struck there never returned to be counted. He recommended armoring the parts that showed the least damage on returning aircraft. Wald's analysis was published as a series of SRG memoranda and was later reissued by the Center for Naval Analyses; his reasoning continued to inform aircraft survivability work through the Korean and Vietnam wars.[15]

### Berkson's paradox (collider bias)

Berkson's paradox, also called collider bias, is a specific form of selection bias that arises when sample inclusion depends on two or more variables, creating a spurious association between them. In [causal inference](/wiki/causal_inference) terminology, a collider is a variable that is caused by two or more other variables. Conditioning on the collider (for example, by restricting the sample to observations where the collider takes a particular value) opens a non-causal path between its parent variables, producing a misleading correlation.[11]

Joseph Berkson first described this in 1946 using a hospital-based study. Patients admitted to a hospital may have either diabetes or cholecystitis (gallbladder inflammation), and both conditions independently increase the probability of hospitalization. Among hospitalized patients, diabetes and cholecystitis appear negatively correlated, even though no such relationship exists in the general population. The hospital admission variable acts as a collider.[1]

In machine learning contexts, Berkson's paradox can emerge in several ways:

- A hiring model trained on applicants who were either highly educated or highly experienced (but rarely both) might incorrectly learn a negative correlation between education and experience.
- A content recommendation system trained on viral posts (high reach but low engagement) and niche posts (low reach but deep engagement) might falsely conclude that reach and engagement are inversely related.
- A loan default model trained on approved applicants (who were selected for having either high income or high credit score) could learn a spurious negative relationship between income and credit score.

### Attrition bias

Attrition bias occurs when participants who drop out of a longitudinal study or stop contributing data are systematically different from those who remain. In clinical trials, sicker patients may withdraw because the treatment is not working, leaving only healthier patients in the dataset and making the treatment appear more effective than it actually is. In machine learning, attrition bias manifests when user churn is non-random. A [recommendation system](/wiki/recommender_system) trained on long-term user interactions may overfit to the preferences of loyal users while failing to serve the needs of casual or dissatisfied users who left the platform.

### Coverage bias

Coverage bias occurs when certain segments of the target population are entirely absent from the data collection process. A facial recognition system trained on images scraped from social media in North America and Europe will underrepresent populations from Africa, South Asia, and other regions, leading to lower accuracy for those groups. Coverage bias differs from sampling bias in degree: while sampling bias involves underrepresentation, coverage bias involves complete exclusion.

### Non-response bias

Non-response bias arises when the individuals who do not provide data differ systematically from those who do. In survey-based data collection, individuals from certain demographic groups may be less likely to respond. When these non-responses correlate with the outcome variable, models trained on the collected data will produce skewed predictions. For instance, a customer satisfaction model built on survey responses may overestimate overall satisfaction because dissatisfied customers are less likely to respond.

### Exclusion bias

Exclusion bias occurs during data preprocessing when certain records are systematically removed based on criteria that correlate with the outcome. Removing incomplete records (listwise deletion) can introduce bias if the missingness is not random. For example, if patients with severe conditions are more likely to have missing lab values because they were too ill to complete all tests, excluding those records biases the dataset toward milder cases.

### Time interval bias

Time interval bias arises when the time window chosen for data collection does not represent the full temporal distribution of the phenomenon under study. Training a retail demand forecasting model only on data from the holiday season would produce predictions that overestimate baseline demand. Similarly, stopping a clinical trial early when interim results look promising can exaggerate the treatment effect.

## Comparison of selection bias types

| Type | Mechanism | Stage introduced | Example in ML |
|---|---|---|---|
| [Sampling bias](/wiki/sampling_bias) | Non-random inclusion from population | Data collection | Training on English-only web text for a multilingual model |
| Self-selection bias | Subjects choose to participate | Data collection | Learning from voluntary product reviews |
| Survivorship bias | Only "survivors" observed | Data collection/curation | Stock prediction trained on currently listed companies only |
| Berkson's paradox | Conditioning on a collider variable | Data collection/filtering | Hospital study creates spurious disease correlations |
| [Attrition bias](/wiki/attrition_bias) | Differential dropout over time | Longitudinal data | Users who churn are absent from training data |
| [Coverage bias](/wiki/coverage_bias) | Population segments entirely missing | Data collection | [Facial recognition](/wiki/image_recognition) trained without certain demographics |
| Non-response bias | Differential response rates | Data collection | Surveys where dissatisfied users do not respond |
| Exclusion bias | Systematic removal during preprocessing | Data preprocessing | Dropping records with missing values |
| Time interval bias | Unrepresentative time window | Data collection | Training on holiday-season data for year-round prediction |

## Selection bias and causal inference

Selection bias is closely connected to problems in [causal inference](/wiki/causal_inference), particularly when analyzed through the framework of directed acyclic graphs (DAGs). A DAG represents causal relationships as directed edges between variables. Selection bias corresponds to conditioning on a collider or a descendant of a collider, which opens a spurious path between variables that are not causally related.[11]

Consider the following causal structure:

```
X --> S <-- Y
```

Here, both $X$ and $Y$ influence selection $S$. In the full population, $X$ and $Y$ may be independent. But within the selected sample (where $S = 1$), $X$ and $Y$ become spuriously correlated. Any model trained on this selected sample will learn a relationship between $X$ and $Y$ that does not hold in the general population.

This DAG-based perspective clarifies why selection bias is so difficult to address with standard statistical techniques alone. If the selection mechanism is unknown or unobserved, no amount of post-hoc analysis on the selected sample can fully correct the bias. The only reliable solutions involve either collecting data from the full population, modeling the selection mechanism explicitly, or using instrumental variables that affect selection but not the outcome.

## How does selection bias relate to covariate shift and domain adaptation?

Selection bias is closely related to the machine learning concepts of [covariate shift](/wiki/covariate_shift) and domain adaptation. Covariate shift occurs when the input distribution changes between training and deployment, while the conditional distribution of outputs given inputs remains the same:

$$P_{\text{train}}(X) \neq P_{\text{test}}(X), \quad P_{\text{train}}(Y \mid X) = P_{\text{test}}(Y \mid X)$$

This is a specific form of selection bias where the selection depends only on the features $X$ and not on the label $Y$.[7] When selection depends on $Y$ as well, the problem becomes more complex and falls under the broader category of dataset shift.

[Domain adaptation](/wiki/domain_adaptation) methods attempt to bridge the gap between a source domain (training data) and a target domain (deployment environment) by learning representations that are invariant across domains or by reweighting source samples to match the target distribution. These techniques directly address the consequences of selection bias by explicitly acknowledging and correcting for distributional differences.

## Real-world case studies

### Facial recognition and the Gender Shades study

One of the most prominent demonstrations of selection bias in AI was the 2018 Gender Shades study conducted by Joy Buolamwini and Timnit Gebru at MIT. The study evaluated commercial gender classification systems from IBM, Microsoft, and Face++ using a balanced dataset of 1,270 faces with equal representation across gender and skin type. The results revealed stark disparities: error rates for darker-skinned women reached up to 34.7%, while error rates for lighter-skinned men were as low as 0.8%.[3] The root cause was selection bias in training data. Standard [computer vision](/wiki/computer_vision) datasets like [ImageNet](/wiki/imagenet) drew their images primarily from search engines like Flickr, which overrepresent males, lighter-skinned individuals, and adults between 18 and 40. Models trained on these datasets inherited these demographic imbalances, producing systems that worked well for the overrepresented groups and poorly for everyone else.

### Amazon's hiring algorithm

In 2018, Amazon discontinued an internal AI recruiting tool after discovering that it systematically penalized resumes from women. The system was trained on resumes submitted over a 10-year period, during which the majority of successful hires were men (reflecting the existing gender imbalance in the tech industry). The training data thus exhibited survivorship bias and self-selection bias: the model learned to prefer language patterns and credentials more common in male applicants, such as the verb "executed." The algorithm penalized resumes that included the word "women's" (as in "women's chess club"), and Amazon ultimately concluded it could not guarantee the tool would be gender-neutral, demonstrating how historical selection bias in human decision-making can become amplified when encoded into automated systems.[12]

### Healthcare risk prediction

A widely cited 2019 study published in Science by Obermeyer et al. found that a commercial algorithm used to manage the health of approximately 200 million Americans exhibited significant racial bias. The algorithm used healthcare spending as a proxy for healthcare need, but because Black patients historically had less access to healthcare and consequently spent less, the algorithm systematically underestimated their health needs. At a given risk score, Black patients were considerably sicker than white patients with the same score.[4] The authors calculated that correcting the bias would raise the share of Black patients flagged for extra care from 17.7 percent to 46.5 percent, and the algorithm's manufacturer confirmed the disparity on a national dataset of more than 3.5 million patients.[16] This is a case of selection bias compounded by proxy variable bias: the training data reflected the existing inequities in healthcare access rather than true health status.

### COMPAS recidivism prediction

The COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) system, used across the United States to predict recidivism risk, became the subject of intense debate after a 2016 ProPublica investigation. Analyzing risk scores for more than 7,000 defendants arrested in Broward County, Florida, ProPublica reported that the false positive rate (defendants labeled high-risk who did not reoffend) was 44.9 percent for Black defendants versus 23.5 percent for white defendants, nearly double.[5][17] While the system's developer, Northpointe (now Equivant), argued that the tool satisfied a different definition of [fairness](/wiki/fairness_metric) (equal predictive accuracy across groups), the controversy highlighted how selection bias in historical criminal justice data (which reflects decades of racially disparate policing and sentencing) can propagate into predictive models, perpetuating systemic inequities.

### NLP and web-scraped training corpora

[Large language models](/wiki/large_language_model) trained on web-scraped text inherit the selection biases present in online content. Web text overrepresents English speakers, younger demographics, users from wealthier countries, and individuals with internet access. Research by Hovy and Spruit (2016) and others has shown that [NLP](/wiki/natural_language_understanding) models trained on these corpora exhibit demographic biases, including lower performance on text produced by older adults, minority ethnic groups, and speakers of non-standard dialects.[9] The Common Crawl corpus, a widely used training source, reflects the content and perspectives of websites that are well-indexed by search engines, systematically underrepresenting voices from communities with lower internet penetration.

## How is selection bias detected?

Identifying selection bias in a dataset is often more difficult than correcting it, because the bias may be invisible if the analyst only has access to the selected sample. Several approaches can help.

### Comparing distributions

When auxiliary information about the target population is available (such as census data or known demographic distributions), one can compare the training data distribution against the population distribution. Statistical tests like the Kolmogorov-Smirnov (KS) test, chi-squared tests, and the maximum mean discrepancy (MMD) can quantify distributional differences across individual features or in aggregate.

### Subgroup performance analysis

Evaluating model performance across demographic subgroups, geographic regions, time periods, or data sources can reveal patterns consistent with selection bias. If a model performs significantly worse on certain subgroups, this may indicate that those groups were underrepresented or absent from the training data.

### Missing data analysis

Examining patterns of missing data can reveal selection mechanisms. If missingness is correlated with the target variable or with sensitive attributes, this suggests that the observed data is a non-random subset of the full population. Little's MCAR test can formally test whether data is missing completely at random.

### Domain classifier method

A practical technique for detecting covariate shift involves training a binary classifier to distinguish between training data and deployment data (or training data and a reference population sample). If the classifier achieves accuracy significantly above 50%, it indicates that the two distributions differ, suggesting selection bias. The features most predictive of the source domain can reveal which dimensions of the data are most affected.

## How can selection bias be mitigated?

### Data collection strategies

The most effective way to address selection bias is to prevent it during data collection. Strategies include:

- **Probability sampling:** Using simple random sampling, stratified sampling, or cluster sampling to ensure each member of the population has a known, non-zero probability of inclusion.
- **Stratified collection:** Deliberately oversampling underrepresented groups to ensure adequate coverage across all relevant subpopulations.
- **Active data acquisition:** Using [active learning](/wiki/active_learning) techniques to identify and prioritize the collection of data points from underrepresented regions of the feature space.
- **Multiple data sources:** Combining data from diverse sources (surveys, administrative records, sensor data) to reduce the impact of any single source's selection mechanism.

### Reweighting and importance sampling

When collecting new data is not feasible, reweighting existing observations can partially correct for selection bias.[10] [Inverse propensity weighting](/wiki/inverse_probability_weighting) (IPW) assigns each observation a weight equal to the inverse of its estimated selection probability. Observations that were less likely to be selected receive higher weights, effectively upsampling underrepresented portions of the population.

The selection probability (propensity score) can be estimated using [logistic regression](/wiki/logistic_regression), [random forests](/wiki/random_forest), or other classification methods. The key challenge is that the propensity model must be correctly specified. If it omits important variables that influence selection, the resulting weights will be inaccurate and may even increase bias.[6]

| Method | Mechanism | Strengths | Limitations |
|---|---|---|---|
| Inverse propensity weighting | Reweight by inverse selection probability | Retains all data, well-understood theory | Sensitive to propensity model misspecification, high-variance weights |
| Heckman correction | Two-stage model with selection equation | Corrects for unobserved selection factors (under assumptions) | Requires normality assumption and exclusion restriction |
| [SMOTE](/wiki/oversampling) | Synthetic oversampling of minority class | Simple to implement, increases minority representation | Does not address the root cause of selection, may introduce noise |
| Stratified sampling | Proportional or balanced sampling by subgroup | Ensures representation across known groups | Requires knowledge of relevant strata before collection |
| Domain adaptation | Learn domain-invariant representations | Handles complex distributional shifts | Computationally expensive, may lose discriminative information |
| [Data augmentation](/wiki/data_augmentation) | Generate additional training examples | Increases effective sample size for underrepresented groups | Augmented data may not reflect true population variation |

### Heckman correction

The Heckman correction, developed by James Heckman in 1979, is a two-stage statistical method designed specifically for sample selection bias. Heckman framed the problem precisely: the bias from "using non-randomly selected samples to estimate behavioral relationships" can be treated "as an ordinary specification error or omitted variables bias."[18] In the first stage, a probit model estimates the probability of each observation being selected into the sample. From this model, the inverse Mills ratio is computed for each observation. In the second stage, the inverse Mills ratio is included as an additional covariate in the outcome regression, correcting for the bias introduced by non-random selection.[2]

The Heckman correction requires two assumptions: (1) the errors in the selection and outcome equations are jointly normally distributed, and (2) an exclusion restriction exists, meaning at least one variable affects selection but not the outcome. When these assumptions hold, the method provides consistent and unbiased estimates. When they are violated, the correction can inflate standard errors or produce misleading results.

### Cross-validation and robustness checks

[Cross-validation](/wiki/cross-validation) helps detect selection bias by evaluating model performance across multiple data splits. If performance varies significantly across folds, this may indicate that certain subsets of the data have different distributional properties. Techniques like leave-one-group-out cross-validation (where each fold corresponds to a different subgroup, time period, or data source) are particularly useful for assessing whether a model generalizes beyond the specific selection patterns in the training data.

### Adversarial debiasing

Adversarial debiasing uses a two-network architecture in which a primary model learns to make predictions while an adversarial network simultaneously tries to predict sensitive attributes (such as demographic group membership) from the primary model's outputs. The primary model is trained to minimize prediction error while also minimizing the adversary's ability to detect group membership, encouraging the model to learn representations that are less dependent on the biased selection patterns in the training data.

### Causal modeling

When the selection mechanism can be represented as a DAG, causal modeling techniques can identify and correct for selection bias. By specifying which variables influence selection and which influence the outcome, researchers can determine which adjustments are necessary and which would introduce additional bias. Instrumental variable methods, which exploit variables that affect selection but not the outcome, provide another avenue for correcting selection bias when direct adjustment is insufficient.

## How does selection bias differ from other forms of bias?

Selection bias is one of several types of bias that can affect machine learning systems.[13] Understanding how it relates to other biases helps practitioners identify and address the correct problem.

| Bias type | Definition | Cause | Relationship to selection bias |
|---|---|---|---|
| [Selection bias](/wiki/selection_bias) | Non-representative sample | Data collection or filtering process | (this article) |
| [Confirmation bias](/wiki/confirmation_bias) | Tendency to seek evidence supporting existing beliefs | Human judgment in data collection or interpretation | Can cause selection bias when data collectors favor confirming data |
| [Measurement bias](/wiki/measurement_bias) | Systematic error in how variables are recorded | Faulty instruments or inconsistent protocols | Can co-occur with selection bias but involves measurement, not sampling |
| [Label bias](/wiki/label_bias) | Errors or subjectivity in annotation | Annotator subjectivity, unclear guidelines | Distinct from selection bias but can compound its effects |
| [Algorithmic bias](/wiki/algorithmic_fairness) | Model produces systematically unfair outputs | Biased data, model design, or objective function | Often a downstream consequence of selection bias in training data |
| [Reporting bias](/wiki/reporting_bias) | Selective reporting of results | Publication incentives, positive result bias | A form of selection bias applied to research findings rather than data |
| [Inductive bias](/wiki/inductive_bias) | Assumptions built into the learning algorithm | Model architecture and design choices | Unrelated to data selection; concerns model structure |

## How does selection bias affect model fairness and ethics?

Selection bias has direct consequences for the [fairness](/wiki/fairness_metric) of machine learning systems. When certain demographic groups are underrepresented or misrepresented in training data, models trained on that data will produce less accurate predictions for those groups. This can lead to violations of several formal fairness criteria:

- **Demographic parity:** If the selected sample overrepresents one group, the model may assign favorable outcomes disproportionately to that group.
- **Equalized odds:** If model performance (true positive and false positive rates) differs across groups due to selection bias, the system fails to treat groups equally.
- **Individual fairness:** If similar individuals from underrepresented groups receive systematically different predictions due to data gaps, the model violates the principle that similar individuals should be treated similarly.

Beyond technical fairness metrics, selection bias raises broader ethical concerns. Automated systems that perpetuate historical inequities (as in the COMPAS and healthcare examples above) can entrench discrimination, reduce accountability, and erode public trust in AI systems. Regulatory frameworks like the EU AI Act increasingly require organizations to demonstrate that their AI systems do not exhibit unjustified bias, making the detection and correction of selection bias a legal as well as ethical obligation.

## Best practices for practitioners

1. **Document data provenance.** Record how data was collected, what populations it represents, and what filtering or preprocessing was applied. Data cards and datasheets for datasets (as proposed by Gebru et al., 2021) provide standardized formats for this documentation.[8]
2. **Audit training data demographics.** Before training, compare the distribution of key variables in the training data against known population statistics to identify gaps.
3. **Evaluate on held-out subgroups.** Test model performance not only on an overall test set but also on subgroups defined by sensitive attributes, geography, time period, and data source.
4. **Use multiple metrics.** A single accuracy number can mask selection bias. Report precision, recall, false positive rates, and false negative rates across subgroups.
5. **Apply reweighting when appropriate.** When the selection mechanism is understood, use propensity scores or importance weights to adjust the training distribution.
6. **Iterate on data collection.** Treat data collection as an ongoing process. Use model error analysis to identify underrepresented regions and collect additional data to fill gaps.
7. **Engage domain experts.** Statisticians, social scientists, and affected communities can identify selection mechanisms that may not be apparent from the data alone.

## References

1. Berkson, J. (1946). "Limitations of the application of fourfold table analysis to hospital data." *Biometrics Bulletin*, 2(3), 47-53.
2. Heckman, J. J. (1979). "Sample selection bias as a specification error." *Econometrica*, 47(1), 153-161.
3. Buolamwini, J., & Gebru, T. (2018). "Gender Shades: Intersectional accuracy disparities in commercial gender classification." *Proceedings of Machine Learning Research*, 81, 1-15.
4. Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). "Dissecting racial bias in an algorithm used to manage the health of populations." *Science*, 366(6464), 447-453.
5. Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). "Machine bias." *ProPublica*. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
6. Cortes, C., Mohri, M., Riley, M., & Rostamizadeh, A. (2008). "Sample selection bias correction theory." *Proceedings of the 19th International Conference on Algorithmic Learning Theory*, 38-53.
7. Shimodaira, H. (2000). "Improving predictive inference under covariate shift by weighting the log-likelihood function." *Journal of Statistical Planning and Inference*, 90(2), 227-244.
8. Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daume III, H., & Crawford, K. (2021). "Datasheets for datasets." *Communications of the ACM*, 64(12), 86-92.
9. Hovy, D., & Spruit, S. L. (2016). "The social impact of natural language processing." *Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics*, 591-598.
10. Zadrozny, B. (2004). "Learning and evaluating classifiers under sample selection bias." *Proceedings of the 21st International Conference on Machine Learning*.
11. Hernán, M. A., Hernández-Diaz, S., & Robins, J. M. (2004). "A structural approach to selection bias." *Epidemiology*, 15(5), 615-625.
12. Dastin, J. (2018). "Amazon scraps secret AI recruiting tool that showed bias against women." *Reuters*. https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G
13. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). "A survey on bias and fairness in machine learning." *ACM Computing Surveys*, 54(6), 1-35.
14. Squire, P. (1988). "Why the 1936 Literary Digest poll failed." *Public Opinion Quarterly*, 52(1), 125-133. See also: "That time the Literary Digest poll got the 1936 election wrong." ProQuest. https://about.proquest.com/en/blog/2016/That-Time-the-Literary-Digest-Poll-Got-1936-Election-Wrong/
15. Mangel, M., & Samaniego, F. J. (1984). "Abraham Wald's work on aircraft survivability." *Journal of the American Statistical Association*, 79(386), 259-267.
16. Obermeyer, Z., & Mullainathan, S. (2019). "Dissecting racial bias in an algorithm used to manage the health of populations." *Science*, 366(6464), 447-453. Summary via UC Berkeley: https://news.berkeley.edu/2019/10/24/widely-used-health-care-prediction-algorithm-biased-against-black-people/
17. Larson, J., Mattu, S., Kirchner, L., & Angwin, J. (2016). "How we analyzed the COMPAS recidivism algorithm." *ProPublica*. https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm
18. Heckman, J. J. (1979). "Sample selection bias as a specification error." *Econometrica*, 47(1), 153-161 (abstract).