The incompatibility of fairness metrics (also called the impossibility theorem of fairness or fairness trade-offs) refers to a set of mathematical results proving that several widely used definitions of algorithmic fairness cannot be satisfied simultaneously, except under highly restricted conditions. These theorems, established independently by Chouldechova (2017) and Kleinberg, Mullainathan, and Raghavan (2016), show that when the base rates of an outcome differ across demographic groups, no classifier can achieve calibration, equal false positive rates, and equal false negative rates at the same time. The results have had a lasting effect on machine learning research, policy design, and public debate about the role of algorithms in high-stakes decision-making.
As artificial intelligence systems are increasingly used to make or assist decisions in areas such as criminal justice, lending, hiring, and healthcare, concerns about unfair outcomes have grown. Different stakeholders often hold different intuitions about what "fairness" means, and researchers have formalized these intuitions into quantitative metrics. By the mid-2010s, dozens of distinct fairness criteria had been proposed in the academic literature. Arvind Narayanan catalogued at least 21 such definitions in a 2018 tutorial at the ACM Conference on Fairness, Accountability, and Transparency (FAccT), illustrating the breadth and fragmentation of the field.
The problem is not just that there are many definitions. The impossibility theorems demonstrate that certain combinations of these definitions are mathematically contradictory under realistic conditions, forcing practitioners to make explicit choices about which aspects of fairness to prioritize.
Before examining the impossibility results, it is helpful to define the main fairness metrics involved. Consider a binary classification setting with a protected attribute (such as race or gender) that divides the population into groups, a true label Y (the actual outcome), and a predicted label or score R produced by the classifier.
| Metric | Also known as | Definition | Formal condition |
|---|---|---|---|
| Calibration | Predictive parity (when applied to binary predictions) | Among individuals assigned a given risk score, the fraction who actually have the positive outcome should be equal across groups | P(Y = 1 | R = r, G = a) = P(Y = 1 | R = r, G = b) for all scores r |
| Demographic parity | Statistical parity, independence | The proportion of individuals receiving a positive prediction should be the same across groups, regardless of the true outcome | P(R = 1 | G = a) = P(R = 1 | G = b) |
| Equalized odds | Separation, error rate balance | Both the true positive rate and false positive rate should be equal across groups | P(R = 1 | Y = y, G = a) = P(R = 1 | Y = y, G = b) for y in {0, 1} |
| Equal opportunity | The true positive rate should be equal across groups (a relaxation of equalized odds) | P(R = 1 | Y = 1, G = a) = P(R = 1 | Y = 1, G = b) | |
| Balance for the positive class | Among individuals who truly belong to the positive class, the average score assigned should be the same across groups | E[R | Y = 1, G = a] = E[R | Y = 1, G = b] | |
| Balance for the negative class | Among individuals who truly belong to the negative class, the average score assigned should be the same across groups | E[R | Y = 0, G = a] = E[R | Y = 0, G = b] |
Each of these metrics captures a distinct and intuitively reasonable aspect of fairness. Calibration ensures that scores mean the same thing for everyone; equalized odds ensures that errors are distributed evenly; demographic parity ensures that outcomes are distributed evenly. The impossibility theorems reveal that these goals are in direct mathematical tension.
Alexandra Chouldechova's 2017 paper, "Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments," published in the journal Big Data (Volume 5, Issue 2, pp. 153-163), provided a concise algebraic proof of the incompatibility between predictive parity and error rate balance.
Chouldechova showed that a binary classifier cannot simultaneously satisfy the following three conditions across two groups with different base rates (prevalence of the positive outcome):
Unless one of two special conditions holds: (a) the classifier is perfect (makes no errors), or (b) the base rates of the positive outcome are equal across groups.
The proof relies on a straightforward algebraic identity connecting the PPV, FPR, FNR, and base rate p of each group:
PPV / (1 - PPV) = [p / (1 - p)] x [(1 - FNR) / FPR]
This equation shows that the PPV is determined by three quantities: the base rate, the false negative rate, and the false positive rate. If the base rates differ between two groups (p_a is not equal to p_b), then equalizing the PPV across groups requires that the ratio (1 - FNR) / FPR also differ between groups. But if both the FNR and FPR are individually equalized across groups, then the ratio is the same for both groups, creating a contradiction. Therefore, predictive parity and error rate balance cannot hold at the same time when base rates differ.
This result is significant because it applies to any classifier, regardless of the algorithm used. It is not a limitation of a particular model or a flaw in training data; it is a structural mathematical constraint. Any system that claims to be "fair" by multiple metrics simultaneously must either be perfectly accurate or operate in a setting where base rates happen to be equal.
Independently of Chouldechova, Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan proved a closely related but distinct impossibility result in their 2016 paper, "Inherent Trade-Offs in the Fair Determination of Risk Scores," published at the 8th Innovations in Theoretical Computer Science (ITCS) Conference in 2017.
Kleinberg et al. formalized three conditions for a risk score assigned to individuals:
The authors proved that except in highly constrained special cases, no risk scoring method can satisfy all three conditions simultaneously. The two special cases are:
Furthermore, Kleinberg et al. showed that even approximately satisfying all three conditions requires the data to closely resemble one of these two special cases. This strengthens the result by ruling out the hope that one might come "close enough" to satisfying all three conditions through careful algorithm design.
While the two theorems were proved independently and use different formalisms, they address the same fundamental tension. Chouldechova's result focuses on binary predictions and uses the language of classification error rates and predictive values. Kleinberg et al. work with continuous risk scores and frame their conditions in terms of calibration and score balance. Both arrive at the same conclusion: calibration-type fairness and error-balance-type fairness are incompatible when base rates differ.
| Feature | Chouldechova (2017) | Kleinberg, Mullainathan, Raghavan (2016) |
|---|---|---|
| Publication | Big Data, Vol. 5, No. 2 | ITCS 2017 |
| Prediction type | Binary classification | Continuous risk scores |
| Incompatible conditions | PPV parity + FPR parity + FNR parity | Calibration + balance for positive class + balance for negative class |
| Special cases allowing compatibility | Perfect classifier or equal base rates | Perfect prediction or equal base rates |
| Proof technique | Algebraic identity linking PPV, FPR, FNR, and base rate | Structural argument from properties of conditional expectations |
Several other researchers have extended or complemented these foundational theorems.
Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q. Weinberger presented "On Fairness and Calibration" at NeurIPS 2017. They showed that calibration is compatible with at most one additional error constraint (for example, equal false negative rates across groups), and that any algorithm achieving this relaxed combination is equivalent to randomizing a fraction of predictions from an existing classifier. They also introduced the concept of "equal cost," where false positive rates and false negative rates are allowed to compensate each other according to a cost function, and showed that the incompatibility persists under this relaxation.
Sorelle Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian approached the problem from a philosophical angle. Their paper, later expanded and published in Communications of the ACM in 2021, argued that different fairness definitions correspond to different assumptions about how the world works. They introduced the concept of a "construct space" (representing true, unobservable qualities of individuals) versus an "observed space" (representing measured features). They showed that two internally consistent but mutually incompatible worldviews underlie different fairness criteria: one worldview assumes that observed differences between groups are largely artifacts of biased measurement processes, while the other assumes that observed data generally reflects genuine differences. This framing explains why disagreements about algorithmic fairness often run deeper than technical disputes.
Sam Corbett-Davies and Sharad Goel provided an extensive survey in "The Measure and Mismeasure of Fairness," later published in the Journal of Machine Learning Research in 2023. They categorized fairness definitions into two families: those that constrain the effects of decisions on disparities and those that constrain the influence of protected characteristics on decisions. Their analysis showed that requiring certain fairness definitions to hold can, perversely, harm the very groups they were intended to protect by leading to strongly Pareto-dominated decision policies.
An earlier and influential work by Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel proposed individual fairness as an alternative to group-based metrics. Individual fairness requires that similar individuals be treated similarly, where similarity is defined by a task-specific distance metric. While this approach sidesteps some group-level incompatibilities, it introduces the challenge of defining the similarity metric itself, which can be subjective and context-dependent.
The most widely discussed real-world example of the fairness metric incompatibility is the controversy surrounding COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), a recidivism risk assessment tool developed by Northpointe (now Equivant). The dispute between ProPublica and Northpointe became a defining moment in the public understanding of algorithmic fairness.
COMPAS was designed by statistician Tim Brennan and corrections professional Dave Wells. The system uses 137 survey questions combined with criminal history data to generate risk scores on a scale of 1 to 10, predicting the likelihood that a defendant will commit a new crime within two years of release. Although race is not used as an explicit input, many of the input variables (such as neighborhood characteristics, employment history, and prior arrests) correlate with racial demographics.
In May 2016, journalists Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner published "Machine Bias" in ProPublica, analyzing COMPAS predictions for approximately 10,000 criminal defendants in Broward County, Florida. They compared the algorithm's predictions against actual recidivism outcomes over a two-year follow-up period.
Their key findings on error rates by race:
| Metric | Black defendants | White defendants |
|---|---|---|
| False positive rate (labeled high-risk but did not reoffend) | 44.8% | 23.5% |
| False negative rate (labeled low-risk but did reoffend) | 28.0% | 47.7% |
| Overall accuracy | 63.8% | 67.0% |
ProPublica concluded that the algorithm was biased against Black defendants because it was nearly twice as likely to incorrectly label Black defendants as high-risk (false positives) and much less likely to incorrectly label White defendants as high-risk.
Northpointe, represented by William Dieterich, Christina Mendoza, and Tim Brennan, rejected the conclusion that COMPAS was racially biased. Their defense centered on calibration: they argued that among defendants who received the same COMPAS score, Black and White defendants had nearly identical actual recidivism rates. For example, among defendants who scored a seven, approximately 60% of White defendants and 61% of Black defendants went on to reoffend. By this standard, the scores "meant the same thing" regardless of race.
Northpointe reported AUC (Area Under the Curve) scores of approximately 0.63 to 0.65, with comparable performance across racial groups, and argued that the error rate disparities ProPublica identified were a natural mathematical consequence of different base rates of recidivism between groups, not evidence of bias.
The COMPAS debate was eventually recognized as a concrete illustration of the impossibility theorems. The recidivism base rates differed significantly between groups: approximately 51.4% for Black defendants and 39.4% for White defendants in the Broward County data. Given this difference in base rates, the theorems by Chouldechova and by Kleinberg et al. prove that no algorithm, regardless of how well designed, can simultaneously achieve:
Both ProPublica and Northpointe were measuring real properties of the system, but they were applying different fairness standards that are mathematically impossible to satisfy at the same time. The dispute was not a matter of one side being wrong; it reflected a genuine, unavoidable trade-off.
The base rate (or prevalence) of the positive outcome in each group is the key quantity driving the impossibility results. When base rates are equal across groups, all of the standard fairness metrics can, in principle, be satisfied simultaneously. But when base rates differ, the mathematical constraints bind.
Base rates often differ across demographic groups for complex social, historical, and structural reasons. In criminal justice, for example, differences in arrest rates across racial groups reflect not only differences in criminal behavior but also differences in policing practices, enforcement priorities, neighborhood surveillance, and historical discrimination. These upstream factors create base rate differences that then trigger the impossibility constraints downstream in any predictive system.
This observation highlights a broader point: algorithmic fairness problems are often symptoms of societal inequalities rather than purely technical defects. No amount of algorithmic tuning can resolve base rate differences caused by structural factors outside the algorithm's control.
The impossibility theorems do not mean that fairness is unachievable or that practitioners should abandon the effort. Rather, they mean that fairness requires explicit choices about which criteria matter most in a given context.
Different application domains may call for different fairness priorities:
| Domain | Typically prioritized metric | Rationale |
|---|---|---|
| Criminal justice (pretrial risk) | Calibration | Scores should accurately reflect risk regardless of group membership; poorly calibrated scores could lead to unjust detention or release decisions |
| Lending and credit | Equalized odds or equal opportunity | Qualified applicants from all groups should have the same chance of receiving credit; unequal error rates may violate fair lending laws |
| Hiring | Demographic parity or equal opportunity | Ensuring proportional representation may be a legal or organizational goal |
| Healthcare | Calibration | Clinical risk scores must reflect true patient risk to avoid misallocation of medical resources |
| Child welfare screening | Balance of false positive and false negative rates | Both wrongful family separations and missed cases of genuine risk carry severe consequences |
There is no universal answer. The choice of which metric to prioritize depends on the specific harms at stake, the legal and regulatory environment, and the values of the affected communities.
Practitioners have developed three broad categories of techniques for managing fairness constraints, even when perfect satisfaction of all metrics is impossible.
Pre-processing methods modify the training data before a model is built. Examples include resampling to balance group representation, removing or transforming features correlated with the protected attribute, and relabeling instances to reduce historical bias in the training labels.
In-processing methods modify the learning algorithm itself. These approaches incorporate fairness constraints directly into the loss function or optimization procedure, so that the model is trained to balance accuracy against one or more fairness objectives. Regularization terms penalizing fairness violations are a common technique.
Post-processing methods adjust the outputs of a trained model. For example, group-specific classification thresholds can be set so that the false positive rate or false negative rate is equalized across groups. Hardt, Price, and Srebro (2016) showed in their paper "Equality of Opportunity in Supervised Learning" that post-processing can optimally adjust any learned predictor to satisfy equalized odds or equal opportunity constraints, though this typically comes at some cost to overall accuracy.
Each approach involves trade-offs. Pre-processing may discard useful information. In-processing may slow training or reduce model expressiveness. Post-processing may produce individually inconsistent decisions (two individuals with the same features may receive different predictions if they belong to different groups).
Recent research has explored ways to work within the constraints of the impossibility theorems rather than against them. Hebert-Johnson, Kim, Reingold, and Rothblum (2018) introduced multicalibration, which requires that a predictor be well-calibrated not just overall but simultaneously for every computationally identifiable subgroup. This framework does not violate the impossibility theorems (it does not attempt to equalize error rates), but it offers strong guarantees that predictions are meaningful for a wide range of overlapping subgroups, including intersections of protected attributes.
Other researchers have explored approximate relaxations of the impossibility results. A 2023 paper presented at ACM FAccT, "The Possibility of Fairness: Revisiting the Impossibility Theorem in Practice," showed that if practitioners accept small margins of error between metrics, large sets of models can simultaneously satisfy false negative rate parity, false positive rate parity, and positive predictive value parity, even when base rates differ moderately. This suggests that the impossibility theorems, while mathematically precise, may be less binding in practice than their worst-case formulations suggest.
The impossibility of simultaneously satisfying multiple fairness criteria has implications for regulation and law. When a company or government agency claims that its algorithm is "fair," the claim is underdetermined without specifying which metric of fairness is being used.
The EU AI Act, which took effect in 2024, classifies AI systems used in criminal justice, credit scoring, hiring, and other sensitive areas as "high-risk" and imposes requirements for bias testing and transparency. However, the Act does not mandate a specific fairness metric, partly because the impossibility results make any single mandate technically problematic. The term "fairness" as used in the algorithmic fairness research community does not appear in the Act's text, reflecting a gap between technical and legal vocabularies.
In the United States, anti-discrimination law (including the Equal Credit Opportunity Act and Title VII of the Civil Rights Act) prohibits both disparate treatment (explicitly using protected attributes) and disparate impact (neutral policies that disproportionately affect protected groups). The impossibility theorems complicate compliance because meeting one legal standard may make it harder to meet another.
Imagine a teacher is grading a spelling test for two classes. One class practiced a lot and most students did well. The other class did not practice as much, so fewer students did well.
Now the teacher wants to be fair in three ways at the same time:
When both classes have the same number of students who did well, the teacher can do all three things. But when the two classes are different (one did much better than the other), it turns out these three goals fight with each other. Making one thing perfectly fair can make another thing unfair. The teacher has to pick which kind of fair matters the most for this particular situation.
That is what the impossibility theorem says about computer programs that make predictions about people. When different groups of people have different rates of something (like getting a loan, committing a crime, or getting sick), the computer cannot be perfectly fair in every way at once. People have to decide which kind of fairness is most important.