# Incompatibility of Fairness Metrics

> Source: https://aiwiki.ai/wiki/incompatibility_of_fairness_metrics
> Updated: 2026-07-12
> Categories: AI Ethics, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

The **incompatibility of fairness metrics** (also called the **impossibility theorem of fairness** or **fairness trade-offs**) is the proven mathematical result that several widely used definitions of [algorithmic fairness](/wiki/algorithmic_fairness) cannot all be satisfied at the same time whenever the base rate of an outcome differs across groups, except in two degenerate cases (a perfect classifier, or equal base rates). The result was established independently by Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan (2016) and by Alexandra Chouldechova (2017): no risk-scoring method can simultaneously achieve calibration, equal false positive rates, and equal false negative rates across groups with unequal prevalence. [1][2] In the words of Kleinberg et al., "except in highly constrained special cases, there is no method that can satisfy these three conditions simultaneously," demonstrating that "key notions of fairness are incompatible with each other." [2] The most cited real-world illustration is the 2016 dispute over the COMPAS recidivism tool, where ProPublica and its developer Northpointe were both correct under different, mutually exclusive definitions of fairness. [3][4] The practical consequence is that choosing a fairness criterion is unavoidably a value judgment, not a purely technical optimization, a finding that has shaped [machine learning](/wiki/machine_learning) research, policy design, and public debate about algorithms in high-stakes decisions.

## Background

As [artificial intelligence](/wiki/artificial_intelligence) systems are increasingly used to make or assist decisions in areas such as criminal justice, lending, hiring, and healthcare, concerns about unfair outcomes have grown. Different stakeholders often hold different intuitions about what "fairness" means, and researchers have formalized these intuitions into quantitative metrics. By the mid-2010s, dozens of distinct fairness criteria had been proposed in the academic literature. Arvind Narayanan catalogued at least 21 such definitions in a 2018 tutorial at the ACM Conference on Fairness, Accountability, and Transparency (FAccT), illustrating the breadth and fragmentation of the field. [10]

The problem is not just that there are many definitions. The impossibility theorems demonstrate that certain combinations of these definitions are mathematically contradictory under realistic conditions, forcing practitioners to make explicit choices about which aspects of fairness to prioritize. This is why the result is sometimes described as turning fairness from an engineering target into a question of which [fairness constraint](/wiki/fairness_constraint) a system should be held to.

## What are the main fairness metrics?

Before examining the impossibility results, it is helpful to define the main fairness metrics involved. Consider a [binary classification](/wiki/binary_classification) setting with a protected attribute (such as race or gender) that divides the population into groups, a true label Y (the actual outcome), and a predicted label or score R produced by the classifier.

| Metric | Also known as | Definition | Formal condition |
|---|---|---|---|
| [Calibration](/wiki/calibration_layer) | Predictive parity (when applied to binary predictions) | Among individuals assigned a given risk score, the fraction who actually have the positive outcome should be equal across groups | $$P(Y = 1 \mid R = r, G = a) = P(Y = 1 \mid R = r, G = b)$$ for all scores $$r$$ |
| [Demographic parity](/wiki/demographic_parity) | Statistical parity, independence | The proportion of individuals receiving a positive prediction should be the same across groups, regardless of the true outcome | $$P(R = 1 \mid G = a) = P(R = 1 \mid G = b)$$ |
| [Equalized odds](/wiki/equalized_odds) | Separation, error rate balance | Both the [true positive rate](/wiki/recall) and [false positive rate](/wiki/false_positive_rate_fpr) should be equal across groups | $$P(R = 1 \mid Y = y, G = a) = P(R = 1 \mid Y = y, G = b)$$ for $$y \in \{0, 1\}$$ |
| Equal opportunity | | The true positive rate should be equal across groups (a relaxation of equalized odds) | $$P(R = 1 \mid Y = 1, G = a) = P(R = 1 \mid Y = 1, G = b)$$ |
| Balance for the positive class | | Among individuals who truly belong to the positive class, the average score assigned should be the same across groups | $$\mathbb{E}[R \mid Y = 1, G = a] = \mathbb{E}[R \mid Y = 1, G = b]$$ |
| Balance for the negative class | | Among individuals who truly belong to the negative class, the average score assigned should be the same across groups | $$\mathbb{E}[R \mid Y = 0, G = a] = \mathbb{E}[R \mid Y = 0, G = b]$$ |

Each of these metrics captures a distinct and intuitively reasonable aspect of fairness. Calibration ensures that scores mean the same thing for everyone; equalized odds ensures that errors are distributed evenly; demographic parity ensures that outcomes are distributed evenly. The impossibility theorems reveal that these goals are in direct mathematical tension.

## What does the Chouldechova impossibility theorem (2017) say?

Alexandra Chouldechova's 2017 paper, "Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments," published in the journal *Big Data* (Volume 5, Issue 2, pp. 153-163), provided a concise algebraic proof of the incompatibility between predictive parity and error rate balance. Its abstract states the core result plainly: "We demonstrate that the criteria cannot all be simultaneously satisfied when recidivism prevalence differs across groups," and "We then show how disparate impact can arise when a recidivism prediction instrument fails to satisfy the criterion of error rate balance." [1]

### Statement of the theorem

Chouldechova showed that a binary classifier cannot simultaneously satisfy the following three conditions across two groups with different base rates (prevalence of the positive outcome):

1. **Predictive parity**: The [positive predictive value](/wiki/precision) (PPV) is equal across groups.
2. **False positive rate parity**: The [false positive rate](/wiki/false_positive_rate_fpr) (FPR) is equal across groups.
3. **False negative rate parity**: The [false negative rate](/wiki/false_negative_rate) (FNR) is equal across groups.

Unless one of two special conditions holds: (a) the classifier is perfect (makes no errors), or (b) the base rates of the positive outcome are equal across groups.

### Proof sketch

The proof relies on a straightforward algebraic identity connecting the FPR, the PPV, the FNR, and the base rate $$p$$ (prevalence) of each group: [1]

$$
\text{FPR} = \left[\frac{p}{1 - p}\right] \times \left[\frac{1 - \text{PPV}}{\text{PPV}}\right] \times (1 - \text{FNR})
$$

This equation shows that, once the base rate $$p$$ is fixed, the FPR is pinned down by the PPV and the FNR. If two groups share the same PPV (predictive parity holds) but have different base rates ($$p_a \ne p_b$$), then the FPR and FNR cannot both be equal across the two groups: forcing predictive parity onto unequal prevalence mathematically forces the error rates apart. Therefore, predictive parity and error rate balance cannot hold at the same time when base rates differ.

### Significance

This result is significant because it applies to any classifier, regardless of the algorithm used. It is not a limitation of a particular model or a flaw in training data; it is a structural mathematical constraint. Any system that claims to be "fair" by multiple metrics simultaneously must either be perfectly accurate or operate in a setting where base rates happen to be equal.

## What does the Kleinberg-Mullainathan-Raghavan impossibility theorem (2016) say?

Independently of Chouldechova, Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan proved a closely related but distinct impossibility result in their 2016 paper, "Inherent Trade-Offs in the Fair Determination of Risk Scores" (arXiv:1609.05807), published at the 8th Innovations in Theoretical Computer Science (ITCS) Conference in 2017. [2]

### The three fairness conditions

Kleinberg et al. formalized three conditions for a risk score assigned to individuals:

1. **Calibration within groups**: For each group, among individuals assigned a particular score value $$v$$, the expected fraction who truly belong to the positive class should be $$v$$. In other words, the score should be an accurate estimate of the probability of the positive outcome, separately within each group.
2. **Balance for the positive class**: Among individuals who truly belong to the positive class, the average score should be the same across groups.
3. **Balance for the negative class**: Among individuals who truly belong to the negative class, the average score should be the same across groups.

### Statement of the theorem

The authors proved that except in highly constrained special cases, no risk scoring method can satisfy all three conditions simultaneously. As the abstract puts it, "we prove that except in highly constrained special cases, there is no method that can satisfy these three conditions simultaneously." [2] The two special cases are:

1. **Perfect prediction**: The risk score perfectly separates positive and negative cases (there is no overlap in score distributions between the two classes).
2. **Equal base rates**: The prevalence of the positive outcome is the same in both groups.

Furthermore, Kleinberg et al. showed that "even satisfying all three conditions approximately requires that the data lie in an approximate version of one of the constrained special cases identified by our theorem." [2] This strengthens the result by ruling out the hope that one might come "close enough" to satisfying all three conditions through careful algorithm design.

### Relationship to the Chouldechova result

While the two theorems were proved independently and use different formalisms, they address the same fundamental tension. Chouldechova's result focuses on binary predictions and uses the language of classification error rates and predictive values. Kleinberg et al. work with continuous risk scores and frame their conditions in terms of calibration and score balance. Both arrive at the same conclusion: calibration-type fairness and error-balance-type fairness are incompatible when base rates differ.

| Feature | Chouldechova (2017) | Kleinberg, Mullainathan, Raghavan (2016) |
|---|---|---|
| Publication | *Big Data*, Vol. 5, No. 2 | ITCS 2017 (arXiv:1609.05807) |
| Prediction type | Binary classification | Continuous risk scores |
| Incompatible conditions | PPV parity + FPR parity + FNR parity | Calibration + balance for positive class + balance for negative class |
| Special cases allowing compatibility | Perfect classifier or equal base rates | Perfect prediction or equal base rates |
| Proof technique | Algebraic identity linking FPR, PPV, FNR, and base rate | Structural argument from properties of conditional expectations |

## Additional impossibility results

Several other researchers have extended or complemented these foundational theorems.

### Pleiss et al. (2017): On fairness and calibration

Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q. Weinberger presented "On Fairness and Calibration" at NeurIPS 2017. They showed that calibration is compatible with at most one additional error constraint (for example, equal false negative rates across groups), and that any algorithm achieving this relaxed combination is equivalent to randomizing a fraction of predictions from an existing classifier. [6] They also introduced the concept of "equal cost," where false positive rates and false negative rates are allowed to compensate each other according to a cost function, and showed that the incompatibility persists under this relaxation.

### Friedler, Scheidegger, and Venkatasubramanian (2016): On the (im)possibility of fairness

Sorelle Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian approached the problem from a philosophical angle. Their paper, later expanded and published in *Communications of the ACM* in 2021, argued that different fairness definitions correspond to different assumptions about how the world works. [8] They introduced the concept of a "construct space" (representing true, unobservable qualities of individuals) versus an "observed space" (representing measured features). They showed that two internally consistent but mutually incompatible worldviews underlie different fairness criteria: one worldview assumes that observed differences between groups are largely artifacts of biased measurement processes, while the other assumes that observed data generally reflects genuine differences. This framing explains why disagreements about algorithmic fairness often run deeper than technical disputes.

### Corbett-Davies and Goel (2018): The measure and mismeasure of fairness

Sam Corbett-Davies and Sharad Goel provided an extensive survey in "The Measure and Mismeasure of Fairness," later published in the *Journal of Machine Learning Research* in 2023. [9] They categorized fairness definitions into two families: those that constrain the effects of decisions on disparities and those that constrain the influence of protected characteristics on decisions. Their analysis showed that requiring certain fairness definitions to hold can, perversely, harm the very groups they were intended to protect by leading to strongly Pareto-dominated decision policies.

### Dwork et al. (2012): Fairness through awareness

An earlier and influential work by Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel proposed **individual fairness** as an alternative to group-based metrics. [7] Individual fairness requires that similar individuals be treated similarly, where similarity is defined by a task-specific distance metric. While this approach sidesteps some group-level incompatibilities, it introduces the challenge of defining the similarity metric itself, which can be subjective and context-dependent.

## The COMPAS case study

The most widely discussed real-world example of the fairness metric incompatibility is the controversy surrounding COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), a [recidivism](/wiki/recidivism) risk assessment tool developed by Northpointe (now Equivant). The dispute between ProPublica and Northpointe became a defining moment in the public understanding of [AI bias](/wiki/ai_bias) and algorithmic fairness.

### Background

COMPAS was designed by statistician Tim Brennan and corrections professional Dave Wells. The system uses 137 survey questions combined with criminal history data to generate risk scores on a scale of 1 to 10, predicting the likelihood that a defendant will commit a new crime within two years of release. [13] Although race is not used as an explicit input, many of the input variables (such as neighborhood characteristics, employment history, and prior arrests) correlate with racial demographics.

### What did ProPublica find in 2016?

In May 2016, journalists Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner published "Machine Bias" in ProPublica, analyzing COMPAS predictions for more than 7,000 criminal defendants arrested in Broward County, Florida, in 2013 and 2014. [3] They compared the algorithm's predictions against actual recidivism outcomes over a two-year follow-up period.

Their key findings on error rates by race:

| Metric | Black defendants | White defendants |
|---|---|---|
| False positive rate (labeled high-risk but did not reoffend) | 44.9% | 23.5% |
| False negative rate (labeled low-risk but did reoffend) | 28.0% | 47.7% |

ProPublica concluded that the algorithm was biased against Black defendants because it was nearly twice as likely to incorrectly label Black defendants as high-risk (false positives) and much less likely to incorrectly label White defendants as high-risk. As the article put it, "The formula was particularly likely to falsely flag black defendants as future criminals, wrongly labeling them this way at almost twice the rate as white defendants." [3] Even after controlling for prior crimes, age, and gender, ProPublica reported that Black defendants were 45% more likely to be assigned a higher risk score for committing a future crime of any kind. [3]

### How did Northpointe respond?

Northpointe, represented by William Dieterich, Christina Mendoza, and Tim Brennan, rejected the conclusion that COMPAS was racially biased. [4] Their defense centered on calibration: they argued that among defendants who received the same COMPAS score, Black and White defendants had nearly identical actual recidivism rates. For example, among defendants who scored a seven, approximately 60% of White defendants and 61% of Black defendants went on to reoffend. By this standard, the scores "meant the same thing" regardless of race. [4]

Northpointe reported AUC (Area Under the Curve) scores of roughly 0.63 to 0.70, with comparable performance across racial groups, and argued that the error rate disparities ProPublica identified were a natural mathematical consequence of different base rates of recidivism between groups, not evidence of bias. [4]

### Resolution through the impossibility theorems

The COMPAS debate was eventually recognized as a concrete illustration of the impossibility theorems. The recidivism base rates differed significantly between groups: approximately 51% for Black defendants and 39% for White defendants in the Broward County data. [14] Given this difference in base rates, the theorems by Chouldechova and by Kleinberg et al. prove that no algorithm, regardless of how well designed, can simultaneously achieve:

- Equal calibration across racial groups (Northpointe's standard)
- Equal false positive rates across racial groups (one of ProPublica's concerns)
- Equal false negative rates across racial groups (the other side of error rate balance)

Both ProPublica and Northpointe were measuring real properties of the system, but they were applying different fairness standards that are mathematically impossible to satisfy at the same time. The dispute was not a matter of one side being wrong; it reflected a genuine, unavoidable trade-off.

A separate 2018 study by Julia Dressel and Hany Farid added context on COMPAS's accuracy, finding that the tool was correct about 65.2% of the time and concluding that it "is no more accurate or fair than predictions made by people with little or no criminal justice expertise." [14] They also showed that a simple linear classifier using just two features, age and total number of prior convictions, matched COMPAS's accuracy, underscoring that the fairness tension stems from differing base rates rather than from the sophistication of the model.

## Why do base rates matter?

The base rate (or prevalence) of the positive outcome in each group is the key quantity driving the impossibility results. When base rates are equal across groups, all of the standard fairness metrics can, in principle, be satisfied simultaneously. But when base rates differ, the mathematical constraints bind.

Base rates often differ across demographic groups for complex social, historical, and structural reasons. In criminal justice, for example, differences in arrest rates across racial groups reflect not only differences in criminal behavior but also differences in policing practices, enforcement priorities, neighborhood surveillance, and historical discrimination. These upstream factors create base rate differences that then trigger the impossibility constraints downstream in any predictive system.

This observation highlights a broader point: algorithmic fairness problems are often symptoms of societal inequalities rather than purely technical defects. No amount of algorithmic tuning can resolve base rate differences caused by structural factors outside the algorithm's control.

## Practical implications

The impossibility theorems do not mean that fairness is unachievable or that practitioners should abandon the effort. Rather, they mean that fairness requires explicit choices about which criteria matter most in a given context.

### How do practitioners choose among fairness criteria?

Different application domains may call for different fairness priorities:

| Domain | Typically prioritized metric | Rationale |
|---|---|---|
| Criminal justice (pretrial risk) | Calibration | Scores should accurately reflect risk regardless of group membership; poorly calibrated scores could lead to unjust detention or release decisions |
| Lending and credit | Equalized odds or equal opportunity | Qualified applicants from all groups should have the same chance of receiving credit; unequal error rates may violate fair lending laws |
| Hiring | Demographic parity or equal opportunity | Ensuring proportional representation may be a legal or organizational goal |
| Healthcare | Calibration | Clinical risk scores must reflect true patient risk to avoid misallocation of medical resources |
| Child welfare screening | Balance of false positive and false negative rates | Both wrongful family separations and missed cases of genuine risk carry severe consequences |

There is no universal answer. The choice of which metric to prioritize depends on the specific harms at stake, the legal and regulatory environment, and the values of the affected communities.

### Bias mitigation approaches

Practitioners have developed three broad categories of techniques for managing fairness constraints, even when perfect satisfaction of all metrics is impossible.

**Pre-processing methods** modify the training data before a model is built. Examples include resampling to balance group representation, removing or transforming features correlated with the protected attribute, and relabeling instances to reduce historical bias in the training labels.

**In-processing methods** modify the learning algorithm itself. These approaches incorporate fairness constraints directly into the [loss function](/wiki/loss_function) or [optimization](/wiki/optimizer) procedure, so that the model is trained to balance accuracy against one or more fairness objectives. Regularization terms penalizing fairness violations are a common technique.

**Post-processing methods** adjust the outputs of a trained model. For example, group-specific classification thresholds can be set so that the false positive rate or false negative rate is equalized across groups. Hardt, Price, and Srebro (2016) showed in their paper "Equality of Opportunity in Supervised Learning" that post-processing can optimally adjust any learned predictor to satisfy equalized odds or equal opportunity constraints, though this typically comes at some cost to overall accuracy. [5]

Each approach involves trade-offs. Pre-processing may discard useful information. In-processing may slow training or reduce model expressiveness. Post-processing may produce individually inconsistent decisions (two individuals with the same features may receive different predictions if they belong to different groups).

### Multicalibration and newer frameworks

Recent research has explored ways to work within the constraints of the impossibility theorems rather than against them. Hebert-Johnson, Kim, Reingold, and Rothblum (2018) introduced **multicalibration**, which requires that a predictor be well-calibrated not just overall but simultaneously for every computationally identifiable subgroup. [12] This framework does not violate the impossibility theorems (it does not attempt to equalize error rates), but it offers strong guarantees that predictions are meaningful for a wide range of overlapping subgroups, including intersections of protected attributes.

Other researchers have explored approximate relaxations of the impossibility results. A 2023 paper presented at ACM FAccT, "The Possibility of Fairness: Revisiting the Impossibility Theorem in Practice," showed that if practitioners accept small margins of error between metrics, large sets of models can simultaneously satisfy false negative rate parity, false positive rate parity, and positive predictive value parity, even when base rates differ moderately. [13] This suggests that the impossibility theorems, while mathematically precise, may be less binding in practice than their worst-case formulations suggest.

## Policy and regulatory context

The impossibility of simultaneously satisfying multiple fairness criteria has implications for regulation and law. When a company or government agency claims that its algorithm is "fair," the claim is underdetermined without specifying which metric of fairness is being used.

The [EU AI Act](/wiki/eu_ai_act), which took effect in 2024, classifies AI systems used in criminal justice, credit scoring, hiring, and other sensitive areas as "high-risk" and imposes requirements for bias testing and transparency. However, the Act does not mandate a specific fairness metric, partly because the impossibility results make any single mandate technically problematic. The term "fairness" as used in the algorithmic fairness research community does not appear in the Act's text, reflecting a gap between technical and legal vocabularies.

In the United States, anti-discrimination law (including the Equal Credit Opportunity Act and Title VII of the Civil Rights Act) prohibits both disparate treatment (explicitly using protected attributes) and [disparate impact](/wiki/disparate_impact) (neutral policies that disproportionately affect protected groups). The impossibility theorems complicate compliance because meeting one legal standard may make it harder to meet another.

## Explain like I'm 5 (ELI5)

Imagine a teacher is grading a spelling test for two classes. One class practiced a lot and most students did well. The other class did not practice as much, so fewer students did well.

Now the teacher wants to be fair in three ways at the same time:

1. If a student gets a gold star, that student really did well on the test, no matter which class they are in.
2. Students in both classes who did well should have the same chance of getting a gold star.
3. Students in both classes who did poorly should have the same chance of not getting a gold star.

When both classes have the same number of students who did well, the teacher can do all three things. But when the two classes are different (one did much better than the other), it turns out these three goals fight with each other. Making one thing perfectly fair can make another thing unfair. The teacher has to pick which kind of fair matters the most for this particular situation.

That is what the impossibility theorem says about computer programs that make predictions about people. When different groups of people have different rates of something (like getting a loan, committing a crime, or getting sick), the computer cannot be perfectly fair in every way at once. People have to decide which kind of fairness is most important.

## See also

- [Algorithmic fairness](/wiki/algorithmic_fairness)
- [Algorithmic bias](/wiki/bias_ethics_fairness)
- [AI bias](/wiki/ai_bias)
- [Calibration](/wiki/calibration_layer)
- [Confusion matrix](/wiki/confusion_matrix)
- [Demographic parity](/wiki/demographic_parity)
- [Disparate impact](/wiki/disparate_impact)
- [Equalized odds](/wiki/equalized_odds)
- [Fairness constraint](/wiki/fairness_constraint)
- [Fairness metric](/wiki/fairness_metric)
- [False positive rate](/wiki/false_positive_rate_fpr)
- [Precision](/wiki/precision)
- [Recall](/wiki/recall)

## References

1. Chouldechova, A. (2017). "Fair prediction with disparate impact: A study of bias in recidivism prediction instruments." *Big Data*, 5(2), 153-163. https://doi.org/10.1089/big.2016.0047 (preprint arXiv:1703.00056).
2. Kleinberg, J., Mullainathan, S., & Raghavan, M. (2016). "Inherent trade-offs in the fair determination of risk scores." *Proceedings of the 8th Innovations in Theoretical Computer Science Conference (ITCS 2017)*. arXiv:1609.05807. https://arxiv.org/abs/1609.05807
3. Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). "Machine Bias." *ProPublica*, May 23, 2016. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
4. Dieterich, W., Mendoza, C., & Brennan, T. (2016). "COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity." Northpointe Inc.
5. Hardt, M., Price, E., & Srebro, N. (2016). "Equality of opportunity in supervised learning." *Advances in Neural Information Processing Systems 29 (NeurIPS 2016)*, 3315-3323.
6. Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., & Weinberger, K. Q. (2017). "On fairness and calibration." *Advances in Neural Information Processing Systems 30 (NeurIPS 2017)*.
7. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012). "Fairness through awareness." *Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (ITCS 2012)*, 214-226.
8. Friedler, S. A., Scheidegger, C., & Venkatasubramanian, S. (2021). "The (im)possibility of fairness: Different value systems require different mechanisms for fair decision making." *Communications of the ACM*, 64(4), 136-143.
9. Corbett-Davies, S., & Goel, S. (2018). "The measure and mismeasure of fairness: A critical review of fair machine learning." *arXiv preprint arXiv:1808.00023*. Later published in *Journal of Machine Learning Research*, 2023.
10. Narayanan, A. (2018). "21 fairness definitions and their politics." Tutorial at the ACM Conference on Fairness, Accountability, and Transparency (FAccT).
11. Barocas, S., Hardt, M., & Narayanan, A. (2023). *Fairness and Machine Learning: Limitations and Opportunities*. MIT Press.
12. Hebert-Johnson, U., Kim, M. P., Reingold, O., & Rothblum, G. N. (2018). "Multicalibration: Calibration for the (computationally-identifiable) masses." *Proceedings of the 35th International Conference on Machine Learning (ICML 2018)*.
13. Bell, A., Bynum, L., Drushchak, N., Heber, T., Isber, L., Jacobson, C., & Atwood, L. (2023). "The possibility of fairness: Revisiting the impossibility theorem in practice." *Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (FAccT 2023)*.
14. Dressel, J., & Farid, H. (2018). "The accuracy, fairness, and limits of predicting recidivism." *Science Advances*, 4(1), eaao5580. https://doi.org/10.1126/sciadv.aao5580