Incompatibility of Fairness Metrics

The incompatibility of fairness metrics (also called the impossibility theorem of fairness or fairness trade-offs) refers to a set of mathematical results proving that several widely used definitions of algorithmic fairness cannot be satisfied simultaneously, except under highly restricted conditions. These theorems, established independently by Chouldechova (2017) and Kleinberg, Mullainathan, and Raghavan (2016), show that when the base rates of an outcome differ across demographic groups, no classifier can achieve calibration, equal false positive rates, and equal false negative rates at the same time. The results have had a lasting effect on machine learning research, policy design, and public debate about the role of algorithms in high-stakes decision-making.

Background

As artificial intelligence systems are increasingly used to make or assist decisions in areas such as criminal justice, lending, hiring, and healthcare, concerns about unfair outcomes have grown. Different stakeholders often hold different intuitions about what "fairness" means, and researchers have formalized these intuitions into quantitative metrics. By the mid-2010s, dozens of distinct fairness criteria had been proposed in the academic literature. Arvind Narayanan catalogued at least 21 such definitions in a 2018 tutorial at the ACM Conference on Fairness, Accountability, and Transparency (FAccT), illustrating the breadth and fragmentation of the field.

The problem is not just that there are many definitions. The impossibility theorems demonstrate that certain combinations of these definitions are mathematically contradictory under realistic conditions, forcing practitioners to make explicit choices about which aspects of fairness to prioritize.

Key fairness metrics

Before examining the impossibility results, it is helpful to define the main fairness metrics involved. Consider a binary classification setting with a protected attribute (such as race or gender) that divides the population into groups, a true label Y (the actual outcome), and a predicted label or score R produced by the classifier.

Metric	Also known as	Definition	Formal condition
Calibration	Predictive parity (when applied to binary predictions)	Among individuals assigned a given risk score, the fraction who actually have the positive outcome should be equal across groups	P(Y = 1 \| R = r, G = a) = P(Y = 1 \| R = r, G = b) for all scores r
Demographic parity	Statistical parity, independence	The proportion of individuals receiving a positive prediction should be the same across groups, regardless of the true outcome	P(R = 1 \| G = a) = P(R = 1 \| G = b)
Equalized odds	Separation, error rate balance	Both the true positive rate and false positive rate should be equal across groups	P(R = 1 \| Y = y, G = a) = P(R = 1 \| Y = y, G = b) for y in {0, 1}
Equal opportunity		The true positive rate should be equal across groups (a relaxation of equalized odds)	P(R = 1 \| Y = 1, G = a) = P(R = 1 \| Y = 1, G = b)
Balance for the positive class		Among individuals who truly belong to the positive class, the average score assigned should be the same across groups	E[R \| Y = 1, G = a] = E[R \| Y = 1, G = b]
Balance for the negative class		Among individuals who truly belong to the negative class, the average score assigned should be the same across groups	E[R \| Y = 0, G = a] = E[R \| Y = 0, G = b]

Each of these metrics captures a distinct and intuitively reasonable aspect of fairness. Calibration ensures that scores mean the same thing for everyone; equalized odds ensures that errors are distributed evenly; demographic parity ensures that outcomes are distributed evenly. The impossibility theorems reveal that these goals are in direct mathematical tension.

The Chouldechova impossibility theorem (2017)

Alexandra Chouldechova's 2017 paper, "Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments," published in the journal Big Data (Volume 5, Issue 2, pp. 153-163), provided a concise algebraic proof of the incompatibility between predictive parity and error rate balance.

Statement of the theorem

Chouldechova showed that a binary classifier cannot simultaneously satisfy the following three conditions across two groups with different base rates (prevalence of the positive outcome):

Predictive parity: The positive predictive value (PPV) is equal across groups.
False positive rate parity: The false positive rate (FPR) is equal across groups.
False negative rate parity: The false negative rate (FNR) is equal across groups.

Unless one of two special conditions holds: (a) the classifier is perfect (makes no errors), or (b) the base rates of the positive outcome are equal across groups.

Proof sketch

The proof relies on a straightforward algebraic identity connecting the PPV, FPR, FNR, and base rate p of each group:

PPV / (1 - PPV) = [p / (1 - p)] x [(1 - FNR) / FPR]

This equation shows that the PPV is determined by three quantities: the base rate, the false negative rate, and the false positive rate. If the base rates differ between two groups (p_a is not equal to p_b), then equalizing the PPV across groups requires that the ratio (1 - FNR) / FPR also differ between groups. But if both the FNR and FPR are individually equalized across groups, then the ratio is the same for both groups, creating a contradiction. Therefore, predictive parity and error rate balance cannot hold at the same time when base rates differ.

Significance

This result is significant because it applies to any classifier, regardless of the algorithm used. It is not a limitation of a particular model or a flaw in training data; it is a structural mathematical constraint. Any system that claims to be "fair" by multiple metrics simultaneously must either be perfectly accurate or operate in a setting where base rates happen to be equal.

The Kleinberg-Mullainathan-Raghavan impossibility theorem (2016)

Independently of Chouldechova, Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan proved a closely related but distinct impossibility result in their 2016 paper, "Inherent Trade-Offs in the Fair Determination of Risk Scores," published at the 8th Innovations in Theoretical Computer Science (ITCS) Conference in 2017.

The three fairness conditions

Kleinberg et al. formalized three conditions for a risk score assigned to individuals:

Calibration within groups: For each group, among individuals assigned a particular score value v, the expected fraction who truly belong to the positive class should be v. In other words, the score should be an accurate estimate of the probability of the positive outcome, separately within each group.
Balance for the positive class: Among individuals who truly belong to the positive class, the average score should be the same across groups.
Balance for the negative class: Among individuals who truly belong to the negative class, the average score should be the same across groups.

Statement of the theorem

The authors proved that except in highly constrained special cases, no risk scoring method can satisfy all three conditions simultaneously. The two special cases are:

Perfect prediction: The risk score perfectly separates positive and negative cases (there is no overlap in score distributions between the two classes).
Equal base rates: The prevalence of the positive outcome is the same in both groups.

Furthermore, Kleinberg et al. showed that even approximately satisfying all three conditions requires the data to closely resemble one of these two special cases. This strengthens the result by ruling out the hope that one might come "close enough" to satisfying all three conditions through careful algorithm design.

Relationship to the Chouldechova result

While the two theorems were proved independently and use different formalisms, they address the same fundamental tension. Chouldechova's result focuses on binary predictions and uses the language of classification error rates and predictive values. Kleinberg et al. work with continuous risk scores and frame their conditions in terms of calibration and score balance. Both arrive at the same conclusion: calibration-type fairness and error-balance-type fairness are incompatible when base rates differ.

Feature	Chouldechova (2017)	Kleinberg, Mullainathan, Raghavan (2016)
Publication	Big Data, Vol. 5, No. 2	ITCS 2017
Prediction type	Binary classification	Continuous risk scores
Incompatible conditions	PPV parity + FPR parity + FNR parity	Calibration + balance for positive class + balance for negative class
Special cases allowing compatibility	Perfect classifier or equal base rates	Perfect prediction or equal base rates
Proof technique	Algebraic identity linking PPV, FPR, FNR, and base rate	Structural argument from properties of conditional expectations

Additional impossibility results

Several other researchers have extended or complemented these foundational theorems.

Pleiss et al. (2017): On fairness and calibration

Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q. Weinberger presented "On Fairness and Calibration" at NeurIPS 2017. They showed that calibration is compatible with at most one additional error constraint (for example, equal false negative rates across groups), and that any algorithm achieving this relaxed combination is equivalent to randomizing a fraction of predictions from an existing classifier. They also introduced the concept of "equal cost," where false positive rates and false negative rates are allowed to compensate each other according to a cost function, and showed that the incompatibility persists under this relaxation.

Friedler, Scheidegger, and Venkatasubramanian (2016): On the (im)possibility of fairness

Sorelle Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian approached the problem from a philosophical angle. Their paper, later expanded and published in Communications of the ACM in 2021, argued that different fairness definitions correspond to different assumptions about how the world works. They introduced the concept of a "construct space" (representing true, unobservable qualities of individuals) versus an "observed space" (representing measured features). They showed that two internally consistent but mutually incompatible worldviews underlie different fairness criteria: one worldview assumes that observed differences between groups are largely artifacts of biased measurement processes, while the other assumes that observed data generally reflects genuine differences. This framing explains why disagreements about algorithmic fairness often run deeper than technical disputes.

Corbett-Davies and Goel (2018): The measure and mismeasure of fairness

Sam Corbett-Davies and Sharad Goel provided an extensive survey in "The Measure and Mismeasure of Fairness," later published in the Journal of Machine Learning Research in 2023. They categorized fairness definitions into two families: those that constrain the effects of decisions on disparities and those that constrain the influence of protected characteristics on decisions. Their analysis showed that requiring certain fairness definitions to hold can, perversely, harm the very groups they were intended to protect by leading to strongly Pareto-dominated decision policies.

Dwork et al. (2012): Fairness through awareness

An earlier and influential work by Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel proposed individual fairness as an alternative to group-based metrics. Individual fairness requires that similar individuals be treated similarly, where similarity is defined by a task-specific distance metric. While this approach sidesteps some group-level incompatibilities, it introduces the challenge of defining the similarity metric itself, which can be subjective and context-dependent.

The COMPAS case study

The most widely discussed real-world example of the fairness metric incompatibility is the controversy surrounding COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), a recidivism risk assessment tool developed by Northpointe (now Equivant). The dispute between ProPublica and Northpointe became a defining moment in the public understanding of algorithmic fairness.

Background

COMPAS was designed by statistician Tim Brennan and corrections professional Dave Wells. The system uses 137 survey questions combined with criminal history data to generate risk scores on a scale of 1 to 10, predicting the likelihood that a defendant will commit a new crime within two years of release. Although race is not used as an explicit input, many of the input variables (such as neighborhood characteristics, employment history, and prior arrests) correlate with racial demographics.

ProPublica's analysis (2016)

In May 2016, journalists Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner published "Machine Bias" in ProPublica, analyzing COMPAS predictions for approximately 10,000 criminal defendants in Broward County, Florida. They compared the algorithm's predictions against actual recidivism outcomes over a two-year follow-up period.

Their key findings on error rates by race:

Metric	Black defendants	White defendants
False positive rate (labeled high-risk but did not reoffend)	44.8%	23.5%
False negative rate (labeled low-risk but did reoffend)	28.0%	47.7%
Overall accuracy	63.8%	67.0%

ProPublica concluded that the algorithm was biased against Black defendants because it was nearly twice as likely to incorrectly label Black defendants as high-risk (false positives) and much less likely to incorrectly label White defendants as high-risk.

Northpointe's response

Northpointe, represented by William Dieterich, Christina Mendoza, and Tim Brennan, rejected the conclusion that COMPAS was racially biased. Their defense centered on calibration: they argued that among defendants who received the same COMPAS score, Black and White defendants had nearly identical actual recidivism rates. For example, among defendants who scored a seven, approximately 60% of White defendants and 61% of Black defendants went on to reoffend. By this standard, the scores "meant the same thing" regardless of race.

Northpointe reported AUC (Area Under the Curve) scores of approximately 0.63 to 0.65, with comparable performance across racial groups, and argued that the error rate disparities ProPublica identified were a natural mathematical consequence of different base rates of recidivism between groups, not evidence of bias.

Resolution through the impossibility theorems

The COMPAS debate was eventually recognized as a concrete illustration of the impossibility theorems. The recidivism base rates differed significantly between groups: approximately 51.4% for Black defendants and 39.4% for White defendants in the Broward County data. Given this difference in base rates, the theorems by Chouldechova and by Kleinberg et al. prove that no algorithm, regardless of how well designed, can simultaneously achieve:

Equal calibration across racial groups (Northpointe's standard)
Equal false positive rates across racial groups (one of ProPublica's concerns)
Equal false negative rates across racial groups (the other side of error rate balance)

Both ProPublica and Northpointe were measuring real properties of the system, but they were applying different fairness standards that are mathematically impossible to satisfy at the same time. The dispute was not a matter of one side being wrong; it reflected a genuine, unavoidable trade-off.

Why base rates matter

The base rate (or prevalence) of the positive outcome in each group is the key quantity driving the impossibility results. When base rates are equal across groups, all of the standard fairness metrics can, in principle, be satisfied simultaneously. But when base rates differ, the mathematical constraints bind.

Base rates often differ across demographic groups for complex social, historical, and structural reasons. In criminal justice, for example, differences in arrest rates across racial groups reflect not only differences in criminal behavior but also differences in policing practices, enforcement priorities, neighborhood surveillance, and historical discrimination. These upstream factors create base rate differences that then trigger the impossibility constraints downstream in any predictive system.

This observation highlights a broader point: algorithmic fairness problems are often symptoms of societal inequalities rather than purely technical defects. No amount of algorithmic tuning can resolve base rate differences caused by structural factors outside the algorithm's control.

Practical implications

The impossibility theorems do not mean that fairness is unachievable or that practitioners should abandon the effort. Rather, they mean that fairness requires explicit choices about which criteria matter most in a given context.

Choosing among fairness criteria

Different application domains may call for different fairness priorities:

Domain	Typically prioritized metric	Rationale
Criminal justice (pretrial risk)	Calibration	Scores should accurately reflect risk regardless of group membership; poorly calibrated scores could lead to unjust detention or release decisions
Lending and credit	Equalized odds or equal opportunity	Qualified applicants from all groups should have the same chance of receiving credit; unequal error rates may violate fair lending laws
Hiring	Demographic parity or equal opportunity	Ensuring proportional representation may be a legal or organizational goal
Healthcare	Calibration	Clinical risk scores must reflect true patient risk to avoid misallocation of medical resources
Child welfare screening	Balance of false positive and false negative rates	Both wrongful family separations and missed cases of genuine risk carry severe consequences

There is no universal answer. The choice of which metric to prioritize depends on the specific harms at stake, the legal and regulatory environment, and the values of the affected communities.

Bias mitigation approaches

Practitioners have developed three broad categories of techniques for managing fairness constraints, even when perfect satisfaction of all metrics is impossible.

Pre-processing methods modify the training data before a model is built. Examples include resampling to balance group representation, removing or transforming features correlated with the protected attribute, and relabeling instances to reduce historical bias in the training labels.

In-processing methods modify the learning algorithm itself. These approaches incorporate fairness constraints directly into the loss function or optimization procedure, so that the model is trained to balance accuracy against one or more fairness objectives. Regularization terms penalizing fairness violations are a common technique.

Post-processing methods adjust the outputs of a trained model. For example, group-specific classification thresholds can be set so that the false positive rate or false negative rate is equalized across groups. Hardt, Price, and Srebro (2016) showed in their paper "Equality of Opportunity in Supervised Learning" that post-processing can optimally adjust any learned predictor to satisfy equalized odds or equal opportunity constraints, though this typically comes at some cost to overall accuracy.

Each approach involves trade-offs. Pre-processing may discard useful information. In-processing may slow training or reduce model expressiveness. Post-processing may produce individually inconsistent decisions (two individuals with the same features may receive different predictions if they belong to different groups).

Multicalibration and newer frameworks

Recent research has explored ways to work within the constraints of the impossibility theorems rather than against them. Hebert-Johnson, Kim, Reingold, and Rothblum (2018) introduced multicalibration, which requires that a predictor be well-calibrated not just overall but simultaneously for every computationally identifiable subgroup. This framework does not violate the impossibility theorems (it does not attempt to equalize error rates), but it offers strong guarantees that predictions are meaningful for a wide range of overlapping subgroups, including intersections of protected attributes.

Other researchers have explored approximate relaxations of the impossibility results. A 2023 paper presented at ACM FAccT, "The Possibility of Fairness: Revisiting the Impossibility Theorem in Practice," showed that if practitioners accept small margins of error between metrics, large sets of models can simultaneously satisfy false negative rate parity, false positive rate parity, and positive predictive value parity, even when base rates differ moderately. This suggests that the impossibility theorems, while mathematically precise, may be less binding in practice than their worst-case formulations suggest.

Policy and regulatory context

The impossibility of simultaneously satisfying multiple fairness criteria has implications for regulation and law. When a company or government agency claims that its algorithm is "fair," the claim is underdetermined without specifying which metric of fairness is being used.

The EU AI Act, which took effect in 2024, classifies AI systems used in criminal justice, credit scoring, hiring, and other sensitive areas as "high-risk" and imposes requirements for bias testing and transparency. However, the Act does not mandate a specific fairness metric, partly because the impossibility results make any single mandate technically problematic. The term "fairness" as used in the algorithmic fairness research community does not appear in the Act's text, reflecting a gap between technical and legal vocabularies.

In the United States, anti-discrimination law (including the Equal Credit Opportunity Act and Title VII of the Civil Rights Act) prohibits both disparate treatment (explicitly using protected attributes) and disparate impact (neutral policies that disproportionately affect protected groups). The impossibility theorems complicate compliance because meeting one legal standard may make it harder to meet another.

Explain like I'm 5 (ELI5)

Imagine a teacher is grading a spelling test for two classes. One class practiced a lot and most students did well. The other class did not practice as much, so fewer students did well.

Now the teacher wants to be fair in three ways at the same time:

If a student gets a gold star, that student really did well on the test, no matter which class they are in.
Students in both classes who did well should have the same chance of getting a gold star.
Students in both classes who did poorly should have the same chance of not getting a gold star.

When both classes have the same number of students who did well, the teacher can do all three things. But when the two classes are different (one did much better than the other), it turns out these three goals fight with each other. Making one thing perfectly fair can make another thing unfair. The teacher has to pick which kind of fair matters the most for this particular situation.

That is what the impossibility theorem says about computer programs that make predictions about people. When different groups of people have different rates of something (like getting a loan, committing a crime, or getting sick), the computer cannot be perfectly fair in every way at once. People have to decide which kind of fairness is most important.

References

Chouldechova, A. (2017). "Fair prediction with disparate impact: A study of bias in recidivism prediction instruments." *Big Data*, 5(2), 153-163.
Kleinberg, J., Mullainathan, S., & Raghavan, M. (2016). "Inherent trade-offs in the fair determination of risk scores." *Proceedings of the 8th Innovations in Theoretical Computer Science Conference (ITCS 2017)*.
Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). "Machine Bias." *ProPublica*, May 23, 2016.
Dieterich, W., Mendoza, C., & Brennan, T. (2016). "COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity." Northpointe Inc.
Hardt, M., Price, E., & Srebro, N. (2016). "Equality of opportunity in supervised learning." *Advances in Neural Information Processing Systems 29 (NeurIPS 2016)*, 3315-3323.
Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., & Weinberger, K. Q. (2017). "On fairness and calibration." *Advances in Neural Information Processing Systems 30 (NeurIPS 2017)*.
Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012). "Fairness through awareness." *Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (ITCS 2012)*, 214-226.
Friedler, S. A., Scheidegger, C., & Venkatasubramanian, S. (2021). "The (im)possibility of fairness: Different value systems require different mechanisms for fair decision making." *Communications of the ACM*, 64(4), 136-143.
Corbett-Davies, S., & Goel, S. (2018). "The measure and mismeasure of fairness: A critical review of fair machine learning." *arXiv preprint arXiv:1808.00023*. Later published in *Journal of Machine Learning Research*, 2023.
Narayanan, A. (2018). "21 fairness definitions and their politics." Tutorial at the ACM Conference on Fairness, Accountability, and Transparency (FAccT).
Barocas, S., Hardt, M., & Narayanan, A. (2023). *Fairness and Machine Learning: Limitations and Opportunities*. MIT Press.
Hebert-Johnson, U., Kim, M. P., Reingold, O., & Rothblum, G. N. (2018). "Multicalibration: Calibration for the (computationally-identifiable) masses." *Proceedings of the 35th International Conference on Machine Learning (ICML 2018)*.
Bell, A., Bynum, L., Drushchak, N., Heber, T., Isber, L., Jacobson, C., & Atwood, L. (2023). "The possibility of fairness: Revisiting the impossibility theorem in practice." *Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (FAccT 2023)*.

Background

Key fairness metrics

The Chouldechova impossibility theorem (2017)

Statement of the theorem

Proof sketch

Significance

The Kleinberg-Mullainathan-Raghavan impossibility theorem (2016)

The three fairness conditions

Statement of the theorem

Relationship to the Chouldechova result

Additional impossibility results

Pleiss et al. (2017): On fairness and calibration

Friedler, Scheidegger, and Venkatasubramanian (2016): On the (im)possibility of fairness

Corbett-Davies and Goel (2018): The measure and mismeasure of fairness

Dwork et al. (2012): Fairness through awareness

The COMPAS case study

Background

ProPublica's analysis (2016)

Northpointe's response

Resolution through the impossibility theorems

Why base rates matter

Practical implications

Choosing among fairness criteria

Bias mitigation approaches

Multicalibration and newer frameworks

Policy and regulatory context

Explain like I'm 5 (ELI5)

See also

References

Improve this article

Related Articles

ARC-AGI 2

Disparate Impact

Disparate Treatment

Counterfactual Fairness

Demographic Parity

Equalized Odds

Background

Key fairness metrics

The Chouldechova impossibility theorem (2017)

Statement of the theorem

Proof sketch

Significance

The Kleinberg-Mullainathan-Raghavan impossibility theorem (2016)

The three fairness conditions

Statement of the theorem

Relationship to the Chouldechova result

Additional impossibility results

Pleiss et al. (2017): On fairness and calibration

Friedler, Scheidegger, and Venkatasubramanian (2016): On the (im)possibility of fairness

Corbett-Davies and Goel (2018): The measure and mismeasure of fairness

Dwork et al. (2012): Fairness through awareness

The COMPAS case study

Background

ProPublica's analysis (2016)

Northpointe's response

Resolution through the impossibility theorems

Why base rates matter

Practical implications

Choosing among fairness criteria

Bias mitigation approaches

Multicalibration and newer frameworks

Policy and regulatory context

Explain like I'm 5 (ELI5)

See also

References

Related Articles

ARC-AGI 2

Disparate Impact

Disparate Treatment

Counterfactual Fairness

Demographic Parity

Equalized Odds