Predictive Parity

Predictive parity is a fairness metric in machine learning that requires an algorithm's positive predictive value (PPV), also known as precision, to be equal across all demographic groups defined by a sensitive attribute. When a classifier satisfies predictive parity, the probability that a positive prediction is correct does not depend on which group the individual belongs to. The concept is also referred to as outcome test or test fairness in the academic literature.

Predictive parity belongs to the broader sufficiency family of fairness criteria, which require that the outcome variable Y is conditionally independent of the protected attribute A given the prediction. It stands in contrast to separation-based criteria such as equalized odds and independence-based criteria such as demographic parity. The tensions between these families of fairness definitions have been the subject of several impossibility theorems that carry significant consequences for real-world algorithmic decision-making.

Formal definition

Let Y denote the true binary outcome (1 = positive, 0 = negative), let A denote a sensitive (protected) attribute with groups a and b (for example, race or gender), and let Ŷ denote the classifier's predicted outcome.

A classifier satisfies predictive parity if and only if:

P(Y = 1 | Ŷ = 1, A = a) = P(Y = 1 | Ŷ = 1, A = b)

In words, the positive predictive value (PPV) must be the same for both groups. Since PPV = TP / (TP + FP), this is equivalent to requiring equal precision across groups.

A classifier satisfying equal PPV will also have equal false discovery rates (FDR = 1 - PPV) across groups:

P(Y = 0 | Ŷ = 1, A = a) = P(Y = 0 | Ŷ = 1, A = b)

Relationship to sufficiency

Predictive parity is a special case of the sufficiency criterion. Full sufficiency requires:

P(Y = y | Ŷ = ŷ, A = a) = P(Y = y | Ŷ = ŷ, A = b)  for all y, ŷ

This means the sensitive attribute provides no additional information about the true outcome once the prediction is known. Predictive parity only enforces this condition for the positive prediction (Ŷ = 1). A stronger version, conditional use accuracy equality, requires equal PPV and equal negative predictive value (NPV) simultaneously.

Connection to calibration

Predictive parity is closely related to calibration (also called well-calibration or test fairness). A risk score S is calibrated across groups if:

P(Y = 1 | S = s, A = a) = P(Y = 1 | S = s, A = b) = s  for all scores s

Calibration is a stronger requirement than predictive parity. A calibrated classifier automatically satisfies predictive parity, but a classifier satisfying predictive parity is not necessarily calibrated. Calibration requires the predicted probability to match the true probability at every score level, whereas predictive parity only requires that the overall fraction of correct positive predictions is equal across groups.

Confusion matrix components

Predictive parity is computed from specific entries in the confusion matrix. The following table shows how predictive parity relates to the confusion matrix:

Component	Definition	Role in predictive parity
True positives (TP)	Correctly predicted positives	Numerator of PPV
False positives (FP)	Incorrectly predicted positives	Denominator term of PPV
True negatives (TN)	Correctly predicted negatives	Not directly involved
False negatives (FN)	Incorrectly predicted negatives	Not directly involved
PPV (precision)	TP / (TP + FP)	Must be equal across groups
FDR	FP / (TP + FP) = 1 - PPV	Equal when PPV is equal

Comparison with other fairness metrics

Predictive parity is one of several group fairness metrics. Each metric enforces a different statistical constraint, and they reflect different intuitions about what it means for an algorithm to be fair.

Fairness metric	Formal condition	Fairness family	Intuition
Predictive parity	P(Y=1 \| Ŷ=1, A=a) = P(Y=1 \| Ŷ=1, A=b)	Sufficiency	Positive predictions are equally trustworthy across groups
Equalized odds	P(Ŷ=1 \| Y=y, A=a) = P(Ŷ=1 \| Y=y, A=b) for y in {0,1}	Separation	Equal true positive rates and equal false positive rates across groups
Equal opportunity	P(Ŷ=1 \| Y=1, A=a) = P(Ŷ=1 \| Y=1, A=b)	Separation	Equal true positive rates across groups
Demographic parity	P(Ŷ=1 \| A=a) = P(Ŷ=1 \| A=b)	Independence	Equal selection rates across groups
Calibration	P(Y=1 \| S=s, A=a) = P(Y=1 \| S=s, A=b) = s	Sufficiency	Predicted probabilities match true probabilities for all groups
Counterfactual fairness	P(Ŷ_A=a \| X) = P(Ŷ_A=b \| X)	Individual / Causal	Prediction would remain the same if the individual's group membership were changed

Impossibility theorems

Several foundational results in algorithmic fairness prove that predictive parity cannot, in general, be satisfied simultaneously with other fairness criteria. These results are sometimes collectively called the "impossibility of fairness" theorems.

Chouldechova's theorem (2017)

Alexandra Chouldechova proved that when the base rates (prevalence of the positive outcome) differ between two groups, it is mathematically impossible for any classifier to simultaneously satisfy both predictive parity and classification parity (equal false positive rates and equal false negative rates).

The proof rests on the following identity derived from Bayes' theorem:

PPV / (1 - PPV) = (p / (1 - p)) * ((1 - FNR) / FPR)

where p is the base rate, FNR is the false negative rate, and FPR is the false positive rate.

If PPV is held equal across two groups (predictive parity) and the base rates p differ, then the ratio (1 - FNR) / FPR must also differ. This means the error rates cannot be equal. Conversely, if FPR and FNR are equalized across groups (classification parity), then PPV must differ whenever base rates differ.

This result has a direct practical implication: in any domain where the prevalence of the outcome varies across demographic groups (which is common in criminal justice, healthcare, and lending), a designer must choose between calibrating the predictions equally (predictive parity) or distributing errors equally (classification parity). Both cannot be achieved at once.

Kleinberg, Mullainathan, and Raghavan's theorem (2016)

Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan proved a related but distinct impossibility result. They defined three fairness conditions for risk scores:

Calibration within groups: Among individuals assigned a particular risk score, the fraction who are actually positive should be the same across groups.
Balance for the positive class: The average risk score assigned to individuals who are actually positive should be the same across groups.
Balance for the negative class: The average risk score assigned to individuals who are actually negative should be the same across groups.

Their theorem shows that except in highly constrained special cases, no risk score can satisfy all three conditions simultaneously. The two special cases are: (a) perfect prediction, where the classifier makes no errors, and (b) equal base rates across groups. In all other realistic scenarios, at least one condition must be violated.

Since calibration within groups implies predictive parity, this theorem further constrains the feasibility of achieving predictive parity alongside error-rate-based fairness.

Pleiss et al. (2017)

Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Weinberger investigated the tension between calibration and error rate balance more closely. They showed that calibration is compatible with at most one error constraint (specifically, equal false negative rates across groups) and that any algorithm satisfying this relaxation is no better than randomizing a fraction of predictions from an existing classifier.

The COMPAS debate

The most prominent real-world illustration of the predictive parity tradeoff is the controversy over COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), a risk assessment tool used in the U.S. criminal justice system to predict recidivism.

Background

In May 2016, the investigative journalism organization ProPublica published an article titled "Machine Bias" by Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Their analysis examined COMPAS risk scores for over 10,000 criminal defendants in Broward County, Florida, and found that Black defendants were roughly twice as likely as white defendants to be falsely labeled as high risk for recidivism (higher false positive rate), while white defendants labeled as low risk were more likely to go on to reoffend (higher false negative rate).

Northpointe's response

Northpointe (the company that developed COMPAS, now called Equivant) responded in July 2016 with a report authored by William Dieterich, Christina Mendoza, and Tim Brennan titled "COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity." They argued that the algorithm was fair because it satisfied predictive parity: among defendants classified as high risk, roughly the same proportion of Black and white defendants actually went on to reoffend. In other words, a COMPAS score of 7 meant approximately the same probability of recidivism regardless of race.

The core disagreement

The two sides were measuring fairness by different metrics:

Stakeholder	Fairness metric used	What it requires	Finding
ProPublica	Error rate balance (classification parity)	Equal FPR and FNR across racial groups	COMPAS fails: Black defendants have higher FPR
Northpointe	Predictive parity (calibration)	Equal PPV across racial groups	COMPAS passes: PPV is roughly equal across groups

Because the base rate of recidivism differed between Black and white defendants in the Broward County data, Chouldechova's impossibility theorem explains why COMPAS could not satisfy both criteria simultaneously. The disagreement was not merely a matter of analytic error by either party; it reflected a genuine mathematical constraint.

Broader impact

The COMPAS debate had a significant impact on the field of algorithmic fairness. It prompted the development of formal impossibility theorems, spurred new research on post-processing methods for bias mitigation, and raised public awareness about the normative choices embedded in algorithmic systems. Dressel and Farid (2018) further showed that COMPAS was no more accurate than predictions made by untrained people recruited from Amazon Mechanical Turk, and that equivalent accuracy could be achieved with a simple linear classifier using only two features.

Applications and use cases

Predictive parity is relevant in any domain where an algorithm makes binary or risk-based predictions that affect people differently across demographic groups.

Criminal justice

Risk assessment instruments like COMPAS, PSA (Public Safety Assessment), and ORAS (Ohio Risk Assessment System) assign risk scores to defendants for pretrial release, sentencing, and parole decisions. Predictive parity requires that a given risk score carries the same meaning regardless of the defendant's race, gender, or other protected characteristics.

Healthcare

Clinical prediction models are used to estimate disease risk, allocate medical resources, and prioritize patients. Predictive parity ensures that when a model predicts a patient is at high risk for a condition, that prediction is equally reliable for patients of different races, ethnicities, ages, or genders. Violations can lead to systematic over-treatment or under-treatment of certain populations.

Credit and lending

Credit scoring models predict the likelihood that a borrower will repay a loan. Predictive parity in this context means that among all applicants approved for a loan (positive prediction), the fraction who actually repay should be the same across demographic groups. This ensures that the model's approvals carry equal financial meaning regardless of the applicant's background.

Hiring and education

Automated resume screening and admissions algorithms predict candidate success. Predictive parity requires that a positive prediction (e.g., "this candidate will succeed") is equally accurate across demographic groups, so that members of no group are disproportionately subject to incorrect positive predictions.

Achieving predictive parity

Several techniques can be used to move a classifier toward predictive parity, though each involves tradeoffs with other fairness criteria and overall accuracy.

Pre-processing methods

Pre-processing approaches modify the training data before the model is built. Techniques include resampling to balance class distributions within each group, reweighting training examples, and transforming features to remove correlations with the protected attribute. These methods aim to produce data from which a standard classifier will naturally produce predictions with equal PPV across groups.

In-processing methods

In-processing approaches modify the learning algorithm itself. Constraints can be added to the optimization objective that penalize differences in PPV across groups during training. Regularization terms can also be designed to encourage sufficiency-based fairness. These methods tend to be model-specific and require access to the training pipeline.

Post-processing methods

Post-processing approaches adjust the outputs of an already-trained model. The most common technique is group-specific threshold adjustment: rather than using a single classification threshold for all groups, different thresholds are chosen for each group so that the resulting PPV is equalized. This approach is model-agnostic and does not require retraining, making it practical for organizations using off-the-shelf models.

A 2022 study by Dwork et al. proposed a model-agnostic post-processing transformation function specifically designed to achieve predictive rate parity with minimal impact on overall model performance.

Method type	When applied	Requires retraining	Model-agnostic	Typical tradeoff
Pre-processing	Before training	Yes	Yes	May reduce overall accuracy
In-processing	During training	Yes	No	Added complexity in optimization
Post-processing	After training	No	Yes	May sacrifice other fairness metrics

Challenges and limitations

Unequal base rates

The most fundamental challenge for predictive parity is that base rates (the prevalence of the positive outcome) frequently differ across demographic groups. As Chouldechova's theorem demonstrates, when base rates differ, achieving predictive parity forces unequal error rates. This means that one group will experience a higher false positive rate or a higher false negative rate than another, even when PPV is held constant.

Data quality and representation

Predictive parity relies on accurate labels in the training and evaluation data. If the data reflects historical biases (for example, if arrest records reflect biased policing patterns rather than true criminal behavior), then the "ground truth" labels themselves are corrupted. Achieving predictive parity with respect to biased labels does not guarantee fairness with respect to the true underlying outcomes.

Infra-marginality

The outcome test (predictive parity) can be misleading when applied to aggregate statistics rather than marginal decisions. Two groups can have equal PPV overall while the classifier still discriminates at the margin. This problem, known as infra-marginality, means that predictive parity at the aggregate level does not necessarily imply fairness in individual cases.

Tension with individual fairness

Predictive parity is a group-level fairness criterion. It does not guarantee that similar individuals from different groups receive similar predictions. A classifier can satisfy predictive parity while treating specific individuals unfairly, as long as the group-level PPV statistics are balanced. Individual fairness, which requires that similar individuals receive similar predictions, addresses a complementary concern.

Ignoring negative predictions

Predictive parity focuses only on positive predictions (Ŷ = 1). It does not constrain what happens among negative predictions. A classifier could have equal PPV but very different negative predictive values (NPV) across groups, meaning that negative predictions are more reliable for one group than another.

Explain like I'm 5 (ELI5)

Imagine a teacher who gives gold stars to students she thinks did well on a test. Predictive parity means that when she gives a gold star, she is right about the same percentage of the time for boys and for girls. If she gives gold stars to 10 boys and 8 of them actually did well (80% correct), then when she gives gold stars to 10 girls, about 8 of them should have actually done well too (also 80% correct).

If the gold stars are right 80% of the time for boys but only 60% of the time for girls, that is a problem: the gold star means something different depending on whether you are a boy or a girl. Predictive parity says the gold star should be worth the same for everyone.

The tricky part is that making the gold stars equally accurate for boys and girls might mean the teacher makes more mistakes in other ways. For example, she might miss more girls who actually did well (give them no star when they deserve one). That is the tradeoff that mathematicians have proven is unavoidable when the two groups have different average scores.

References

Chouldechova, A. (2017). "Fair prediction with disparate impact: A study of bias in recidivism prediction instruments." *Big Data*, 5(2), 153-163.
Kleinberg, J., Mullainathan, S., & Raghavan, M. (2016). "Inherent trade-offs in the fair determination of risk scores." *Proceedings of Innovations in Theoretical Computer Science (ITCS)*. arXiv:1609.05807.
Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). "Machine Bias." *ProPublica*, May 23, 2016.
Dieterich, W., Mendoza, C., & Brennan, T. (2016). "COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity." Northpointe Inc.
Verma, S. & Rubin, J. (2018). "Fairness definitions explained." *Proceedings of the IEEE/ACM International Workshop on Software Fairness (FairWare)*, pp. 1-7.
Hardt, M., Price, E., & Srebro, N. (2016). "Equality of opportunity in supervised learning." *Advances in Neural Information Processing Systems (NeurIPS)*, 29, 3323-3331.
Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., & Weinberger, K. Q. (2017). "On fairness and calibration." *Advances in Neural Information Processing Systems (NeurIPS)*, 30.
Corbett-Davies, S. & Goel, S. (2018). "The measure and mismeasure of fairness: A critical review of fair machine learning." arXiv:1808.00023.
Dressel, J. & Farid, H. (2018). "The accuracy, fairness, and limits of predicting recidivism." *Science Advances*, 4(1), eaao5580.
Barocas, S., Hardt, M., & Narayanan, A. (2019). *Fairness and Machine Learning: Limitations and Opportunities*. fairmlbook.org.
Berk, R., Heidari, H., Jabbari, S., Kearns, M., & Roth, A. (2018). "Fairness in criminal justice risk assessments: The state of the art." *Sociological Methods & Research*, 50(1), 3-44.
Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., & Huq, A. (2017). "Algorithmic decision making and the cost of fairness." *Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 797-806.

Formal definition

Relationship to sufficiency

Connection to calibration

Confusion matrix components

Comparison with other fairness metrics

Impossibility theorems

Chouldechova's theorem (2017)

Kleinberg, Mullainathan, and Raghavan's theorem (2016)

Pleiss et al. (2017)

The COMPAS debate

Background

Northpointe's response

The core disagreement

Broader impact

Applications and use cases

Criminal justice

Healthcare

Credit and lending

Hiring and education

Achieving predictive parity

Pre-processing methods

In-processing methods

Post-processing methods

Challenges and limitations

Unequal base rates

Data quality and representation

Infra-marginality

Tension with individual fairness

Ignoring negative predictions

Explain like I'm 5 (ELI5)

See also

References

Improve this article

Related Articles

Machine learning terms/Fairness

Automation Bias

Equality of Opportunity

Out-Group Homogeneity Bias

Unawareness (Fairness Through Unawareness)

ARC-AGI 2

Formal definition

Relationship to sufficiency

Connection to calibration

Confusion matrix components

Comparison with other fairness metrics

Impossibility theorems

Chouldechova's theorem (2017)

Kleinberg, Mullainathan, and Raghavan's theorem (2016)

Pleiss et al. (2017)

The COMPAS debate

Background

Northpointe's response

The core disagreement

Broader impact

Applications and use cases

Criminal justice

Healthcare

Credit and lending

Hiring and education

Achieving predictive parity

Pre-processing methods

In-processing methods

Post-processing methods

Challenges and limitations

Unequal base rates

Data quality and representation

Infra-marginality

Tension with individual fairness

Ignoring negative predictions

Explain like I'm 5 (ELI5)

See also

References

Related Articles

Machine learning terms/Fairness

Automation Bias

Equality of Opportunity

Out-Group Homogeneity Bias

Unawareness (Fairness Through Unawareness)

ARC-AGI 2