# Predictive Parity

> Source: https://aiwiki.ai/wiki/predictive_parity
> Updated: 2026-06-24
> Categories: AI Ethics, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Predictive parity** is a group [fairness metric](/wiki/fairness_metric) in [machine learning](/wiki/machine_learning) that holds when a classifier's [positive predictive value](/wiki/precision) (PPV), also called [precision](/wiki/precision), is equal across all demographic groups defined by a [sensitive attribute](/wiki/sensitive_attribute): formally, P(Y = 1 | Ŷ = 1, A = a) = P(Y = 1 | Ŷ = 1, A = b). In plain terms, a positive prediction carries the same probability of being correct no matter which group the individual belongs to. The concept is also called the **outcome test** or **test fairness** in the academic literature. [1][5]

Predictive parity is mathematically incompatible with error-rate balance whenever groups have different base rates: Alexandra Chouldechova proved in 2017 that no classifier can satisfy both predictive parity and equal [false positive rates](/wiki/false_positive_rate_fpr) and equal [false negative rates](/wiki/false_negative_rate) at the same time unless the prevalence of the positive outcome is identical across groups. [1] This single result, restated independently by Kleinberg, Mullainathan, and Raghavan (2016), underlies the long-running dispute over the COMPAS recidivism tool and is one of the central "impossibility" theorems of [algorithmic fairness](/wiki/algorithmic_fairness). [2]

Predictive parity belongs to the broader **sufficiency** family of fairness criteria, which require that the outcome variable Y is conditionally independent of the protected attribute A given the prediction. It stands in contrast to **separation**-based criteria such as [equalized odds](/wiki/equalized_odds) and **independence**-based criteria such as [demographic parity](/wiki/demographic_parity). The tensions between these families of fairness definitions have been the subject of several impossibility theorems that carry significant consequences for real-world algorithmic decision-making. [10]

## Formal definition

Let Y denote the true binary outcome (1 = positive, 0 = negative), let A denote a sensitive (protected) attribute with groups a and b (for example, race or gender), and let Ŷ denote the classifier's predicted outcome.

A classifier satisfies **predictive parity** if and only if:

```
P(Y = 1 | Ŷ = 1, A = a) = P(Y = 1 | Ŷ = 1, A = b)
```

In words, the [positive predictive value](/wiki/precision) (PPV) must be the same for both groups. Since PPV = TP / (TP + FP), this is equivalent to requiring equal [precision](/wiki/precision) across groups. [5]

A classifier satisfying equal PPV will also have equal [false discovery rates](/wiki/false_positive_rate_fpr) (FDR = 1 - PPV) across groups:

```
P(Y = 0 | Ŷ = 1, A = a) = P(Y = 0 | Ŷ = 1, A = b)
```

### Relationship to sufficiency

Predictive parity is a special case of the **sufficiency** criterion. Full sufficiency requires:

```
P(Y = y | Ŷ = ŷ, A = a) = P(Y = y | Ŷ = ŷ, A = b)  for all y, ŷ
```

This means the sensitive attribute provides no additional information about the true outcome once the prediction is known. Predictive parity only enforces this condition for the positive prediction (Ŷ = 1). A stronger version, **conditional use accuracy equality**, requires equal PPV and equal negative predictive value (NPV) simultaneously. [5]

### Connection to calibration

Predictive parity is closely related to **calibration** (also called well-calibration or test fairness). A risk score S is calibrated across groups if:

```
P(Y = 1 | S = s, A = a) = P(Y = 1 | S = s, A = b) = s  for all scores s
```

Calibration is a stronger requirement than predictive parity. A calibrated classifier automatically satisfies predictive parity, but a classifier satisfying predictive parity is not necessarily calibrated. Calibration requires the predicted probability to match the true probability at every score level, whereas predictive parity only requires that the overall fraction of correct positive predictions is equal across groups. [10]

## What entries of the confusion matrix does predictive parity use?

Predictive parity is computed from specific entries in the [confusion matrix](/wiki/confusion_matrix). The following table shows how predictive parity relates to the confusion matrix:

| Component | Definition | Role in predictive parity |
|---|---|---|
| True positives (TP) | Correctly predicted positives | Numerator of PPV |
| [False positives](/wiki/false_positive_fp) (FP) | Incorrectly predicted positives | Denominator term of PPV |
| [True negatives](/wiki/confusion_matrix) (TN) | Correctly predicted negatives | Not directly involved |
| [False negatives](/wiki/false_negative_fn) (FN) | Incorrectly predicted negatives | Not directly involved |
| PPV (precision) | TP / (TP + FP) | Must be equal across groups |
| FDR | FP / (TP + FP) = 1 - PPV | Equal when PPV is equal |

## How does predictive parity differ from other fairness metrics?

Predictive parity is one of several group fairness metrics. Each metric enforces a different statistical constraint, and they reflect different intuitions about what it means for an algorithm to be fair. Barocas, Hardt, and Narayanan group the major observational criteria into three families: **independence** (predictions independent of the protected attribute), **separation** (predictions independent of the attribute given the true outcome), and **sufficiency** (the true outcome independent of the attribute given the prediction). Predictive parity and calibration belong to sufficiency; equalized odds and equal opportunity belong to separation; demographic parity belongs to independence. [10]

| Fairness metric | Formal condition | Fairness family | Intuition |
|---|---|---|---|
| Predictive parity | P(Y=1 \| Ŷ=1, A=a) = P(Y=1 \| Ŷ=1, A=b) | Sufficiency | Positive predictions are equally trustworthy across groups |
| [Equalized odds](/wiki/equalized_odds) | P(Ŷ=1 \| Y=y, A=a) = P(Ŷ=1 \| Y=y, A=b) for y in {0,1} | Separation | Equal [true positive rates](/wiki/recall) and equal [false positive rates](/wiki/false_positive_rate_fpr) across groups |
| Equal opportunity | P(Ŷ=1 \| Y=1, A=a) = P(Ŷ=1 \| Y=1, A=b) | Separation | Equal true positive rates across groups |
| [Demographic parity](/wiki/demographic_parity) | P(Ŷ=1 \| A=a) = P(Ŷ=1 \| A=b) | Independence | Equal selection rates across groups |
| Calibration | P(Y=1 \| S=s, A=a) = P(Y=1 \| S=s, A=b) = s | Sufficiency | Predicted probabilities match true probabilities for all groups |
| [Counterfactual fairness](/wiki/counterfactual_fairness) | P(Ŷ_A=a \| X) = P(Ŷ_A=b \| X) | Individual / Causal | Prediction would remain the same if the individual's group membership were changed |

## Why can predictive parity not be combined with other fairness criteria?

Several foundational results in algorithmic fairness prove that predictive parity cannot, in general, be satisfied simultaneously with other fairness criteria. These results are sometimes collectively called the "impossibility of fairness" theorems. The AI Wiki maintains a dedicated overview at [incompatibility of fairness metrics](/wiki/incompatibility_of_fairness_metrics).

### Chouldechova's theorem (2017)

Alexandra Chouldechova proved that when the base rates (prevalence of the positive outcome) differ between two groups, it is mathematically impossible for any classifier to simultaneously satisfy both predictive parity and classification parity (equal [false positive rates](/wiki/false_positive_rate_fpr) and equal [false negative rates](/wiki/false_negative_rate)). As she states the consequence directly, "if an instrument satisfies predictive parity ... but the prevalence differs between groups, the instrument cannot achieve equal false positive and false negative rates across those groups." [1]

The proof rests on the following identity derived from Bayes' theorem:

```
PPV / (1 - PPV) = (p / (1 - p)) * ((1 - FNR) / FPR)
```

where p is the base rate, FNR is the [false negative rate](/wiki/false_negative_rate), and FPR is the [false positive rate](/wiki/false_positive_rate_fpr).

If PPV is held equal across two groups (predictive parity) and the base rates p differ, then the ratio (1 - FNR) / FPR must also differ. This means the error rates cannot be equal. Conversely, if FPR and FNR are equalized across groups (classification parity), then PPV must differ whenever base rates differ. [1]

This result has a direct practical implication: in any domain where the prevalence of the outcome varies across demographic groups (which is common in criminal justice, healthcare, and lending), a designer must choose between calibrating the predictions equally (predictive parity) or distributing errors equally (classification parity). Both cannot be achieved at once.

### Kleinberg, Mullainathan, and Raghavan's theorem (2016)

Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan proved a related but distinct impossibility result. They defined three fairness conditions for risk scores: [2]

1. **Calibration within groups**: Among individuals assigned a particular risk score, the fraction who are actually positive should be the same across groups.
2. **Balance for the positive class**: The average risk score assigned to individuals who are actually positive should be the same across groups.
3. **Balance for the negative class**: The average risk score assigned to individuals who are actually negative should be the same across groups.

Their theorem shows that except in highly constrained special cases, no risk score can satisfy all three conditions simultaneously. The two special cases are: (a) perfect prediction, where the classifier makes no errors, and (b) equal base rates across groups. In all other realistic scenarios, at least one condition must be violated. The authors conclude that these results "imply that the three fairness conditions are in general incompatible with each other ... except in special cases." [2]

Since calibration within groups implies predictive parity, this theorem further constrains the feasibility of achieving predictive parity alongside error-rate-based fairness.

### Pleiss et al. (2017)

Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Weinberger investigated the tension between calibration and error rate balance more closely. They showed that calibration is compatible with at most one error constraint, specifically equal false negative rates across groups, and that "any algorithm that satisfies this relaxation is no better than randomizing a percentage of predictions for an existing classifier." [7]

## What was the COMPAS debate about?

The most prominent real-world illustration of the predictive parity tradeoff is the controversy over [COMPAS](/wiki/compas) (Correctional Offender Management Profiling for Alternative Sanctions), a risk assessment tool used in the U.S. criminal justice system to predict [recidivism](/wiki/compas).

### Background

In May 2016, the investigative journalism organization ProPublica published an article titled "Machine Bias" by Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Their analysis examined COMPAS risk scores for more than 10,000 criminal defendants in Broward County, Florida (18,610 people scored in 2013 and 2014, of whom 11,757 were assessed at the pretrial stage), and found that Black defendants were roughly twice as likely as white defendants to be falsely labeled as high risk for recidivism. In the two-year follow-up sample of 6,172 defendants, the [false positive rate](/wiki/false_positive_rate_fpr) for Black defendants who did not reoffend was 44.85%, versus 23.45% for white defendants, while white defendants who did reoffend were mislabeled as low risk at a [false negative rate](/wiki/false_negative_rate) of 47.72%, versus 27.99% for Black defendants. [3]

### Northpointe's response

Northpointe (the company that developed COMPAS, renamed Equivant in 2017) responded in July 2016 with a report authored by William Dieterich, Christina Mendoza, and Tim Brennan titled "COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity." They argued that the algorithm was fair because it satisfied predictive parity: among defendants classified as high risk, roughly the same proportion of Black and white defendants actually went on to reoffend. In other words, a COMPAS score of 7 meant approximately the same probability of recidivism regardless of race. [4]

### The core disagreement

The two sides were measuring fairness by different metrics:

| Stakeholder | Fairness metric used | What it requires | Finding |
|---|---|---|---|
| ProPublica | Error rate balance (classification parity) | Equal FPR and FNR across racial groups | COMPAS fails: Black defendants have higher FPR |
| Northpointe | Predictive parity (calibration) | Equal PPV across racial groups | COMPAS passes: PPV is roughly equal across groups |

Because the base rate of recidivism differed between Black and white defendants in the Broward County data, Chouldechova's impossibility theorem explains why COMPAS could not satisfy both criteria simultaneously. The disagreement was not merely a matter of analytic error by either party; it reflected a genuine mathematical constraint. [1]

### Broader impact

The COMPAS debate had a significant impact on the field of algorithmic fairness. It prompted the development of formal impossibility theorems, spurred new research on post-processing methods for bias mitigation, and raised public awareness about the normative choices embedded in algorithmic systems. Dressel and Farid (2018) further showed that COMPAS, which uses up to 137 features, predicted recidivism with 65.4% accuracy, no better than non-expert humans recruited from Amazon Mechanical Turk, and that a simple logistic-regression classifier using only two features (age and number of prior convictions) achieved comparable accuracy of 66.6%. [9]

## Where is predictive parity used?

Predictive parity is relevant in any domain where an algorithm makes binary or risk-based predictions that affect people differently across demographic groups. [11]

### Criminal justice

Risk assessment instruments like COMPAS, PSA (Public Safety Assessment), and ORAS (Ohio Risk Assessment System) assign risk scores to defendants for pretrial release, sentencing, and parole decisions. Predictive parity requires that a given risk score carries the same meaning regardless of the defendant's race, gender, or other protected characteristics. [11]

### Healthcare

Clinical prediction models are used to estimate disease risk, allocate medical resources, and prioritize patients. Predictive parity ensures that when a model predicts a patient is at high risk for a condition, that prediction is equally reliable for patients of different races, ethnicities, ages, or genders. Violations can lead to systematic over-treatment or under-treatment of certain populations.

### Credit and lending

Credit scoring models predict the likelihood that a borrower will repay a loan. Predictive parity in this context means that among all applicants approved for a loan (positive prediction), the fraction who actually repay should be the same across demographic groups. This ensures that the model's approvals carry equal financial meaning regardless of the applicant's background.

### Hiring and education

Automated resume screening and admissions algorithms predict candidate success. Predictive parity requires that a positive prediction (e.g., "this candidate will succeed") is equally accurate across demographic groups, so that members of no group are disproportionately subject to incorrect positive predictions.

## How is predictive parity achieved?

Several techniques can be used to move a classifier toward predictive parity, though each involves tradeoffs with other fairness criteria and overall accuracy.

### Pre-processing methods

Pre-processing approaches modify the training data before the model is built. Techniques include resampling to balance class distributions within each group, reweighting training examples, and transforming features to remove correlations with the protected attribute. These methods aim to produce data from which a standard classifier will naturally produce predictions with equal PPV across groups.

### In-processing methods

In-processing approaches modify the learning algorithm itself. Constraints can be added to the optimization objective that penalize differences in PPV across groups during training. Regularization terms can also be designed to encourage sufficiency-based fairness. These methods tend to be model-specific and require access to the training pipeline.

### Post-processing methods

Post-processing approaches adjust the outputs of an already-trained model. The most common technique is **group-specific threshold adjustment**: rather than using a single classification threshold for all groups, different thresholds are chosen for each group so that the resulting PPV is equalized. This approach is model-agnostic and does not require retraining, making it practical for organizations using off-the-shelf models. Because the underlying impossibility results still apply, equalizing PPV through post-processing generally comes at the cost of unequal error rates when base rates differ. [6][7]

| Method type | When applied | Requires retraining | Model-agnostic | Typical tradeoff |
|---|---|---|---|---|
| Pre-processing | Before training | Yes | Yes | May reduce overall accuracy |
| In-processing | During training | Yes | No | Added complexity in optimization |
| Post-processing | After training | No | Yes | May sacrifice other fairness metrics |

## What are the limitations of predictive parity?

### Unequal base rates

The most fundamental challenge for predictive parity is that base rates (the prevalence of the positive outcome) frequently differ across demographic groups. As Chouldechova's theorem demonstrates, when base rates differ, achieving predictive parity forces unequal error rates. This means that one group will experience a higher false positive rate or a higher false negative rate than another, even when PPV is held constant. [1]

### Data quality and representation

Predictive parity relies on accurate [labels](/wiki/label) in the training and evaluation data. If the data reflects historical biases (for example, if arrest records reflect biased policing patterns rather than true criminal behavior), then the "ground truth" labels themselves are corrupted. Achieving predictive parity with respect to biased labels does not guarantee fairness with respect to the true underlying outcomes. [8]

### Infra-marginality

The outcome test (predictive parity) can be misleading when applied to aggregate statistics rather than marginal decisions. Two groups can have equal PPV overall while the classifier still discriminates at the margin. This problem, known as infra-marginality, means that predictive parity at the aggregate level does not necessarily imply fairness in individual cases. [8]

### Tension with individual fairness

Predictive parity is a group-level fairness criterion. It does not guarantee that similar individuals from different groups receive similar predictions. A classifier can satisfy predictive parity while treating specific individuals unfairly, as long as the group-level PPV statistics are balanced. [Individual fairness](/wiki/individual_fairness), which requires that similar individuals receive similar predictions, addresses a complementary concern.

### Ignoring negative predictions

Predictive parity focuses only on positive predictions (Ŷ = 1). It does not constrain what happens among negative predictions. A classifier could have equal PPV but very different negative predictive values (NPV) across groups, meaning that negative predictions are more reliable for one group than another.

## Explain like I'm 5 (ELI5)

Imagine a teacher who gives gold stars to students she thinks did well on a test. Predictive parity means that when she gives a gold star, she is right about the same percentage of the time for boys and for girls. If she gives gold stars to 10 boys and 8 of them actually did well (80% correct), then when she gives gold stars to 10 girls, about 8 of them should have actually done well too (also 80% correct).

If the gold stars are right 80% of the time for boys but only 60% of the time for girls, that is a problem: the gold star means something different depending on whether you are a boy or a girl. Predictive parity says the gold star should be worth the same for everyone.

The tricky part is that making the gold stars equally accurate for boys and girls might mean the teacher makes more mistakes in other ways. For example, she might miss more girls who actually did well (give them no star when they deserve one). That is the tradeoff that mathematicians have proven is unavoidable when the two groups have different average scores.

## See also

- [Equalized odds](/wiki/equalized_odds)
- [Demographic parity](/wiki/demographic_parity)
- [Fairness metric](/wiki/fairness_metric)
- [Confusion matrix](/wiki/confusion_matrix)
- [Precision](/wiki/precision)
- [Recall](/wiki/recall)
- [Counterfactual fairness](/wiki/counterfactual_fairness)
- [Disparate impact](/wiki/disparate_impact)
- [Disparate treatment](/wiki/disparate_treatment)
- [Sensitive attribute](/wiki/sensitive_attribute)
- [Bias (ethics/fairness)](/wiki/bias_ethics_fairness)
- [Individual fairness](/wiki/individual_fairness)
- [Incompatibility of fairness metrics](/wiki/incompatibility_of_fairness_metrics)

## References

1. Chouldechova, A. (2017). "Fair prediction with disparate impact: A study of bias in recidivism prediction instruments." *Big Data*, 5(2), 153-163.
2. Kleinberg, J., Mullainathan, S., & Raghavan, M. (2016). "Inherent trade-offs in the fair determination of risk scores." *Proceedings of Innovations in Theoretical Computer Science (ITCS)*. arXiv:1609.05807.
3. Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). "Machine Bias." *ProPublica*, May 23, 2016.
4. Dieterich, W., Mendoza, C., & Brennan, T. (2016). "COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity." Northpointe Inc.
5. Verma, S. & Rubin, J. (2018). "Fairness definitions explained." *Proceedings of the IEEE/ACM International Workshop on Software Fairness (FairWare)*, pp. 1-7.
6. Hardt, M., Price, E., & Srebro, N. (2016). "Equality of opportunity in supervised learning." *Advances in Neural Information Processing Systems (NeurIPS)*, 29, 3323-3331.
7. Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., & Weinberger, K. Q. (2017). "On fairness and calibration." *Advances in Neural Information Processing Systems (NeurIPS)*, 30.
8. Corbett-Davies, S. & Goel, S. (2018). "The measure and mismeasure of fairness: A critical review of fair machine learning." arXiv:1808.00023.
9. Dressel, J. & Farid, H. (2018). "The accuracy, fairness, and limits of predicting recidivism." *Science Advances*, 4(1), eaao5580.
10. Barocas, S., Hardt, M., & Narayanan, A. (2019). *Fairness and Machine Learning: Limitations and Opportunities*. fairmlbook.org (MIT Press, 2023).
11. Berk, R., Heidari, H., Jabbari, S., Kearns, M., & Roth, A. (2018). "Fairness in criminal justice risk assessments: The state of the art." *Sociological Methods & Research*, 50(1), 3-44.
12. Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., & Huq, A. (2017). "Algorithmic decision making and the cost of fairness." *Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 797-806.