Bias in machine learning refers to systematic errors in data, algorithms, or design choices that produce unfair outcomes for certain groups of people. When an AI system consistently favors or disadvantages individuals based on characteristics such as race, gender, age, or socioeconomic status, it exhibits bias in the ethical and fairness sense. Unlike statistical bias (which concerns estimator accuracy), fairness-related bias is about whether a machine learning system treats people equitably and whether its predictions or decisions reflect unjust patterns from the real world.
As AI systems are increasingly used in high-stakes domains such as criminal justice, hiring, lending, and healthcare, the consequences of biased algorithms have become a central concern in both computer science research and public policy. Addressing bias requires understanding where it originates, how to measure it, and what tools and techniques can reduce it.
Imagine you have a robot that helps you pick teams for a game. But the robot learned by watching older kids pick teams, and those older kids always picked boys first. Now the robot thinks boys are better at the game, even though that is not true. The robot has "bias" because it learned an unfair pattern. To fix it, you need to teach the robot with better examples where everyone gets a fair chance, and you need to check the robot's picks to make sure it is being fair to everyone.
Bias can enter a machine learning pipeline at many stages, from data collection through model deployment. Researchers have identified several distinct categories.
Historical bias occurs when training data accurately reflects the real world, but the real world itself contains systemic inequities. For example, if a hiring model is trained on decades of employment records from an industry that historically excluded women, the model may learn to replicate that exclusion, even though the data faithfully represents what happened. Historical bias is particularly difficult to address because the data is not "wrong" in a technical sense; it simply encodes past injustices.
Representation bias arises when certain groups are underrepresented or overrepresented in the training data relative to the population where the model will be deployed. A facial recognition system trained predominantly on images of lighter-skinned individuals will perform poorly on darker-skinned faces. The landmark 2018 Gender Shades study by Joy Buolamwini and Timnit Gebru demonstrated that commercial facial recognition systems from IBM, Microsoft, and Face++ had error rates of up to 34.7% for darker-skinned women, compared to error rates below 1% for lighter-skinned men.
Selection bias occurs when the process used to collect training data systematically excludes certain groups or scenarios. For instance, a medical dataset collected only from urban hospitals may not generalize to rural populations. Selection bias also occurs through survivorship bias, where only successful outcomes are recorded, or through non-response bias, where certain demographics are less likely to participate in data collection.
Measurement bias appears when the features or labels used to train a model are measured differently across groups. In the healthcare domain, Obermeyer et al. (2019) found that a widely used algorithm for identifying patients who need additional care used healthcare spending as a proxy for health needs. Because Black patients historically had less access to healthcare and therefore generated lower costs at the same level of illness, the algorithm systematically underestimated their care needs. At a given risk score, Black patients had on average 26% more chronic conditions than White patients.
Confirmation bias in ML arises when model builders unconsciously process data or design systems in ways that affirm their pre-existing beliefs. A researcher who expects a certain relationship between variables may select features, evaluation metrics, or data subsets that confirm that expectation while ignoring contradictory evidence.
Automation bias is the tendency for humans to over-rely on automated systems, accepting AI outputs without sufficient scrutiny. When a judge places excessive trust in an algorithmic risk score, or a doctor defers uncritically to a diagnostic AI, automation bias can amplify the effects of any underlying algorithmic bias. Studies have shown that automation bias is particularly strong when users lack expertise in the domain or when the AI system presents results with high confidence.
| Bias type | Stage of origin | Description | Example |
|---|---|---|---|
| Historical bias | Data reflects real-world inequity | Training data encodes past societal discrimination | Hiring models replicating historical gender exclusion |
| Representation bias | Data collection | Certain groups are under- or overrepresented in data | Facial recognition failing on darker-skinned individuals |
| Selection bias | Data sampling | Systematic exclusion of groups during data collection | Medical data from urban hospitals only |
| Measurement bias | Feature/label design | Variables measured differently across groups | Using healthcare spending as a proxy for health needs |
| Confirmation bias | Model development | Developers' beliefs influence design choices | Selecting features that confirm expected relationships |
| Automation bias | Deployment | Users over-rely on AI outputs | Judges deferring uncritically to risk scores |
Defining "fairness" in a mathematical sense has proven to be one of the most challenging aspects of the field. Researchers have proposed dozens of formal fairness criteria, each capturing a different intuition about what it means for an algorithm to be fair. The most widely studied definitions fall into several families.
Demographic parity (also called statistical parity or group fairness) requires that the proportion of positive outcomes be equal across groups defined by a protected attribute. Formally, a classifier satisfies demographic parity if P(Y_hat = 1 | A = 0) = P(Y_hat = 1 | A = 1), where A is the protected attribute. For example, a loan approval system satisfies demographic parity if it approves the same percentage of applications from each racial group. Critics argue that demographic parity can be inappropriate when base rates genuinely differ between groups, as it may require approving less qualified applicants from one group to balance the rates.
Equalized odds, introduced by Hardt, Price, and Srebro (2016), requires that the true positive rate and false positive rate be equal across groups. This means the model should be equally accurate for each group, conditional on the true outcome. A recidivism prediction tool satisfies equalized odds if it is equally likely to correctly identify someone who will reoffend (true positive) and equally unlikely to falsely flag someone who will not reoffend (false positive), regardless of race. A relaxed version called equality of opportunity requires only that true positive rates be equal.
A model is calibrated across groups if, among individuals assigned a given risk score, the actual outcome rate is the same regardless of group membership. For example, if a recidivism tool assigns a risk score of 7 out of 10, then approximately 70% of both Black and White defendants assigned that score should actually reoffend. Calibration ensures that the scores mean the same thing for different groups. Northpointe, the creator of the COMPAS tool, argued that their system was calibrated, even as ProPublica demonstrated that error rates differed significantly by race.
Individual fairness, proposed by Dwork et al. (2012), requires that similar individuals receive similar outcomes. This is formalized through the idea that a fair algorithm should be a Lipschitz-continuous function: small changes in an individual's features should produce only small changes in the predicted outcome. The challenge lies in defining an appropriate distance metric that captures "similarity" in a way that is both meaningful and does not smuggle in bias through the choice of features.
Counterfactual fairness, introduced by Kusner et al. (2017), uses causal reasoning to ask: would this individual have received the same decision in a counterfactual world where their protected attribute were different? A decision is counterfactually fair if changing a person's race or gender (and all downstream effects of that attribute) would not change the model's prediction. This approach relies on having a causal model of the data-generating process.
| Fairness definition | Core requirement | Strengths | Limitations |
|---|---|---|---|
| Demographic parity | Equal positive prediction rates across groups | Simple to measure and interpret | May conflict with accuracy when base rates differ |
| Equalized odds | Equal TPR and FPR across groups | Conditions on true outcome, preserving accuracy | Requires access to ground truth labels |
| Calibration | Equal outcome rates at each score level | Ensures scores are meaningful across groups | Can coexist with disparate error rates |
| Individual fairness | Similar people get similar outcomes | Focuses on individual treatment rather than group statistics | Requires a meaningful similarity metric |
| Counterfactual fairness | Same decision in counterfactual world | Accounts for causal structure | Requires a causal model, which may be contested |
One of the most important theoretical results in algorithmic fairness is the demonstration that several desirable fairness criteria cannot all be satisfied simultaneously, except in trivial cases.
Kleinberg, Mullainathan, and Raghavan (2016) proved that three natural fairness conditions for risk scores (calibration within groups, balance for the positive class, and balance for the negative class) cannot all hold at the same time unless either the predictor is perfect or the base rates are equal across groups. Independently, Chouldechova (2017) showed that when base rates differ between groups, it is mathematically impossible for a binary classifier to simultaneously achieve equal false positive rates and equal false negative rates while also being well-calibrated.
These results have profound practical implications. In the COMPAS debate, ProPublica argued that the tool was unfair because Black defendants faced higher false positive rates (being incorrectly labeled high-risk). Northpointe countered that the tool was fair because it was calibrated (a score of 7 meant approximately the same recidivism probability regardless of race). The impossibility theorem shows that both sides were applying legitimate fairness criteria, but satisfying one necessarily meant violating the other, given unequal base rates of recidivism across racial groups.
The impossibility theorem does not mean that fairness is unachievable. Rather, it means that practitioners must make explicit choices about which fairness criteria to prioritize, and these choices involve value judgments that cannot be resolved through mathematics alone.
Several high-profile cases have brought algorithmic bias to public attention and shaped both research and policy.
In May 2016, ProPublica published an investigation of the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) system, a risk assessment tool used by courts across the United States to inform bail, sentencing, and parole decisions. ProPublica's analysis of over 7,000 defendants in Broward County, Florida found that Black defendants were nearly twice as likely as White defendants to be falsely flagged as high-risk (false positives), while White defendants were more likely to be incorrectly labeled as low-risk despite going on to reoffend (false negatives). Even after controlling for prior criminal history, age, and gender, Black defendants were 45% more likely to be assigned a higher risk score.
Northpointe (now Equivant) disputed the findings, arguing that its tool was calibrated. The ensuing academic debate directly led to the impossibility theorem results by Kleinberg et al. and Chouldechova, which demonstrated that the fairness criteria favored by each side were mathematically incompatible.
In 2018, Reuters reported that Amazon had developed an AI recruiting tool starting in 2014 to automate resume screening. The system was trained on 10 years of resumes submitted to the company, which were predominantly from men, reflecting the male-dominated tech industry. The tool learned to penalize resumes containing words associated with women, including the word "women's" (as in "women's chess club captain") and the names of certain all-women's colleges. It also favored language patterns more common in resumes submitted by men, such as verbs like "executed" and "captured." Amazon attempted to correct the biases but ultimately disbanded the project in 2017 after losing confidence that the system could be made gender-neutral.
Obermeyer et al. (2019), published in Science, examined a commercial algorithm used by health systems to identify patients who would benefit from high-risk care management programs. The study found that the algorithm assigned lower risk scores to Black patients who were equally as sick as White patients. The root cause was that the algorithm used healthcare spending as a proxy for health needs, and because of systemic inequities in access to care, Black patients generated lower healthcare costs than equally ill White patients. The researchers estimated that fixing the bias would increase the proportion of Black patients identified for additional care from 17.7% to 46.5%. By collaborating with the algorithm's manufacturer, they demonstrated that using health-based labels (such as avoidable costs or number of chronic conditions) rather than spending reduced racial bias by 84%.
The 2018 Gender Shades study by Buolamwini and Gebru tested three commercial facial recognition systems and found significant disparities in accuracy across intersections of gender and skin type. Darker-skinned women had the highest error rates across all three systems. A follow-up study in 2019 showed that after the initial findings were publicized, all three companies improved their systems, though disparities persisted. In 2020, IBM withdrew from the general-purpose facial recognition market, and Amazon imposed a one-year moratorium on police use of its Rekognition system. Several US cities, including San Francisco, Boston, and Minneapolis, have enacted bans or restrictions on government use of facial recognition technology.
| Incident | Year | Domain | Key finding |
|---|---|---|---|
| COMPAS / ProPublica | 2016 | Criminal justice | Black defendants nearly twice as likely to be falsely flagged as high-risk |
| Amazon hiring tool | 2014-2017 | Employment | System penalized resumes with female-associated terms |
| Healthcare algorithm (Obermeyer et al.) | 2019 | Healthcare | Algorithm underestimated Black patients' care needs due to spending proxy |
| Gender Shades (Buolamwini and Gebru) | 2018 | Facial recognition | Error rates up to 34.7% for darker-skinned women vs. under 1% for lighter-skinned men |
Bias in natural language processing (NLP) systems has received substantial attention since the mid-2010s.
Bolukbasi et al. (2016) demonstrated that word embeddings trained on Google News articles encoded pervasive gender stereotypes. Their widely cited finding showed that word vector arithmetic produced analogies such as "man is to computer programmer as woman is to homemaker." The researchers proposed a debiasing method that projects embeddings onto a gender subspace to neutralize stereotypical associations while preserving legitimate gender distinctions (such as queen/king). Their evaluation showed that the proportion of words exhibiting stereotypical gender associations dropped from 19% to 6% after debiasing. However, subsequent work by Gonen and Goldberg (2019) showed that residual bias could still be recovered from debiased embeddings, suggesting that the debiasing was partly superficial.
Large language models (LLMs) trained on internet text inherit and can amplify societal biases present in their training corpora. A 2023 UNESCO study examining GPT-2, GPT-3.5, and Llama 2 found clear evidence of gender bias: when prompted to generate text about people, the models were three to six times more likely to assign occupations that aligned with gender stereotypes. Llama 2 generated explicitly sexist or misogynistic content in approximately 20% of prompt completions about women. AI-generated content showed a pattern of assigning diverse, professional roles to men (teacher, doctor, engineer) while assigning stereotypical or undervalued roles to women.
Research published in PNAS in 2025 found that even value-aligned language models that pass explicit social bias tests still harbor implicit biases. These models portray socially subordinate groups as more homogeneous, a pattern consistent with well-documented human cognitive biases. Fine-tuned models such as GPT-4o show reduced bias compared to base pretrained models, but the problem remains far from solved.
Protected attributes are personal characteristics shielded from discrimination by law, such as race, gender, age, religion, disability status, and national origin. In many jurisdictions, it is illegal to use these attributes directly in consequential decisions like hiring, lending, and housing.
However, simply removing protected attributes from a model's inputs (a practice sometimes called "fairness through unawareness") is generally insufficient. Other variables in the dataset often serve as proxy variables that are correlated with the protected attribute, allowing the model to indirectly reconstruct the sensitive information. ZIP code is a well-known proxy for race in the United States due to residential segregation patterns. Similarly, first name can serve as a proxy for both race and gender. In lending, the type of bank account or neighborhood characteristics may proxy for race or socioeconomic status.
The challenge of proxy variables is one reason why algorithmic fairness requires active intervention rather than simple omission. Researchers at Carnegie Mellon University have developed methods to systematically identify proxy attributes by searching for variables that are both correlated with a protected feature and influential in the model's decisions. Regulatory bodies such as the US Consumer Financial Protection Bureau (CFPB) use proxy methods (such as Bayesian Improved Surname Geocoding) to impute race and ethnicity for fair lending analyses when actual demographic data is unavailable.
Researchers have developed a variety of techniques for reducing bias in machine learning systems. These are commonly organized by the stage of the ML pipeline at which they intervene.
Pre-processing methods modify the training data before the model is trained, with the goal of removing or reducing bias in the data itself.
| Technique | Description |
|---|---|
| Reweighing | Assigns different weights to training examples to equalize outcomes across groups without modifying the data itself |
| Disparate impact remover | Transforms feature values to improve group fairness while preserving rank-ordering within groups |
| Learning fair representations | Maps data into a new representation that encodes information useful for prediction but removes information about group membership |
| Oversampling / undersampling | Adjusts the representation of different groups in the training set to correct for imbalance |
| Synthetic data generation | Creates additional training examples for underrepresented groups using techniques like SMOTE |
In-processing methods modify the learning algorithm itself to incorporate fairness constraints during training.
| Technique | Description |
|---|---|
| Adversarial debiasing | Trains the model simultaneously with an adversary that tries to predict the protected attribute from the model's predictions; the primary model learns to make predictions that the adversary cannot exploit |
| Fairness constraints | Adds mathematical fairness constraints directly to the optimization objective |
| Regularization | Adds a penalty term to the loss function that penalizes unfair predictions |
| Meta-fair classifier | Uses a meta-learning approach to find classifiers that optimize for a specific fairness metric |
Post-processing methods adjust the model's outputs after training to improve fairness, without modifying the model itself.
| Technique | Description |
|---|---|
| Threshold adjustment | Sets different decision thresholds for different groups to equalize error rates or positive prediction rates |
| Equalized odds post-processing | Adjusts predictions to satisfy equalized odds using a linear program (Hardt et al., 2016) |
| Reject option classification | Gives favorable outcomes to unprivileged groups and unfavorable outcomes to privileged groups in cases where the model is uncertain |
| Calibration adjustment | Recalibrates predicted probabilities to ensure equal calibration across groups |
Several open-source toolkits have been developed to help practitioners assess and mitigate bias in their ML systems.
| Tool | Developer | Key features | Language |
|---|---|---|---|
| AI Fairness 360 (AIF360) | IBM Research | 70+ fairness metrics, 11 bias mitigation algorithms, support for pre-processing, in-processing, and post-processing | Python, R |
| Fairlearn | Microsoft (now community-driven) | Fairness assessment dashboard, mitigation algorithms including exponentiated gradient and grid search, integration with Azure ML | Python |
| What-If Tool | Interactive visual interface for exploring model behavior, counterfactual analysis, threshold optimization, no coding required for basic exploration | Python (TensorBoard/Jupyter) | |
| Aequitas | University of Chicago | Audit toolkit focused on bias and fairness in decision-making systems, generates bias reports | Python |
| Themis-ML | MIT Media Lab | Testing and mitigating discrimination in ML models | Python |
IBM's AI Fairness 360 is one of the most comprehensive toolkits, offering algorithms such as optimized preprocessing, reweighing, adversarial debiasing, reject option classification, disparate impact remover, and equalized odds post-processing. It was transferred to the Linux Foundation AI in July 2020. Microsoft's Fairlearn started in 2018 as an internal project and has since grown into a community-driven open-source project with active development and documentation.
Algorithmic auditing is the systematic evaluation of an AI system's inputs, processing, and outputs to identify and assess bias, discrimination, or other harms. Auditing has emerged as a key governance mechanism for ensuring algorithmic accountability.
External audits involve independent third parties examining a system, while internal audits are conducted by the organization that developed or deployed the system. Researchers have also developed methods for "black-box" auditing, where the auditor does not have access to the model's internals and instead probes the system by submitting inputs and observing outputs.
Audit methodologies draw on techniques from social science research, particularly correspondence studies (such as resume audit studies used to detect hiring discrimination). In AI auditing, controlled demographic pairings, where identical inputs are submitted with only the protected attribute changed, can reveal whether the system treats different groups differently.
Regulatory mandates for algorithmic auditing are increasing. New York City's Local Law 144, which took effect in July 2023, requires employers and employment agencies using automated employment decision tools (AEDTs) to commission annual independent bias audits and publish the results. The audits must report selection rates and impact ratios for sex and race/ethnicity categories.
The EU AI Act requires providers of high-risk AI systems to conduct conformity assessments, maintain technical documentation, and implement risk management systems. These requirements effectively mandate ongoing internal auditing of high-risk systems, with particular attention to bias in data and model outputs.
Governments around the world are developing legal frameworks to address algorithmic bias.
The European Union's AI Act, which came into force in June 2024, is the world's first comprehensive legal framework for AI regulation. It takes a risk-based approach, with specific requirements for high-risk AI systems used in domains such as employment, education, law enforcement, and access to essential services. Article 10 mandates that training, validation, and testing datasets for high-risk systems be "relevant, representative, free of errors and complete," with specific consideration given to potential biases. The Act also allows the processing of sensitive personal data (such as race or gender) strictly for the purpose of bias monitoring, detection, and correction, provided that appropriate safeguards are in place. Penalties for non-compliance can reach up to 35 million euros or 7% of global annual turnover.
The US National Institute of Standards and Technology (NIST) released the AI Risk Management Framework (AI RMF 1.0) in January 2023. The framework identifies fairness as a core characteristic of trustworthy AI and outlines three categories of AI bias that organizations should manage: systemic bias (arising from institutions and societal structures), computational and statistical bias (arising from data and algorithms), and human-cognitive bias (arising from how people perceive and process information). The framework is organized around four functions: Govern, Map, Measure, and Manage. While voluntary, the NIST AI RMF has become a widely adopted reference for AI governance in the United States.
New York City's Local Law 144 (2023) specifically targets bias in automated hiring tools. Colorado's AI Act (2024) requires developers and deployers of high-risk AI systems to use reasonable care to protect consumers from algorithmic discrimination. The White House Blueprint for an AI Bill of Rights (2022) articulated five principles for the design, use, and deployment of AI systems, including protection against algorithmic discrimination. In Canada, the proposed Artificial Intelligence and Data Act (AIDA) would require operators of high-impact AI systems to assess and mitigate bias.
Concerns about bias in automated decision-making predate modern machine learning. In the 1970s and 1980s, researchers in statistics and law studied discrimination in credit scoring, insurance, and hiring systems. The Equal Credit Opportunity Act of 1974 in the United States prohibited discrimination in lending on the basis of race, sex, and other protected characteristics, establishing a legal framework that now applies to algorithmic lending systems.
The field of algorithmic fairness gained significant momentum in the mid-2010s. The Fairness, Accountability, and Transparency (FAccT) community, originally called FAT/ML when it began as a workshop in 2014, grew into a major academic conference. The ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT) has published hundreds of papers on fairness metrics, bias detection, mitigation techniques, and the social implications of algorithmic decision-making.
Arvind Narayanan's 2018 tutorial "21 Fairness Definitions and Their Politics" cataloged the proliferation of fairness definitions and argued that the choice of fairness metric is inherently a political and moral decision, not a purely technical one. This perspective has become widely accepted in the field.