Predictive parity is a fairness metric in machine learning that requires an algorithm's positive predictive value (PPV), also known as precision, to be equal across all demographic groups defined by a sensitive attribute. When a classifier satisfies predictive parity, the probability that a positive prediction is correct does not depend on which group the individual belongs to. The concept is also referred to as outcome test or test fairness in the academic literature.
Predictive parity belongs to the broader sufficiency family of fairness criteria, which require that the outcome variable Y is conditionally independent of the protected attribute A given the prediction. It stands in contrast to separation-based criteria such as equalized odds and independence-based criteria such as demographic parity. The tensions between these families of fairness definitions have been the subject of several impossibility theorems that carry significant consequences for real-world algorithmic decision-making.
Let Y denote the true binary outcome (1 = positive, 0 = negative), let A denote a sensitive (protected) attribute with groups a and b (for example, race or gender), and let Ŷ denote the classifier's predicted outcome.
A classifier satisfies predictive parity if and only if:
P(Y = 1 | Ŷ = 1, A = a) = P(Y = 1 | Ŷ = 1, A = b)
In words, the positive predictive value (PPV) must be the same for both groups. Since PPV = TP / (TP + FP), this is equivalent to requiring equal precision across groups.
A classifier satisfying equal PPV will also have equal false discovery rates (FDR = 1 - PPV) across groups:
P(Y = 0 | Ŷ = 1, A = a) = P(Y = 0 | Ŷ = 1, A = b)
Predictive parity is a special case of the sufficiency criterion. Full sufficiency requires:
P(Y = y | Ŷ = ŷ, A = a) = P(Y = y | Ŷ = ŷ, A = b) for all y, ŷ
This means the sensitive attribute provides no additional information about the true outcome once the prediction is known. Predictive parity only enforces this condition for the positive prediction (Ŷ = 1). A stronger version, conditional use accuracy equality, requires equal PPV and equal negative predictive value (NPV) simultaneously.
Predictive parity is closely related to calibration (also called well-calibration or test fairness). A risk score S is calibrated across groups if:
P(Y = 1 | S = s, A = a) = P(Y = 1 | S = s, A = b) = s for all scores s
Calibration is a stronger requirement than predictive parity. A calibrated classifier automatically satisfies predictive parity, but a classifier satisfying predictive parity is not necessarily calibrated. Calibration requires the predicted probability to match the true probability at every score level, whereas predictive parity only requires that the overall fraction of correct positive predictions is equal across groups.
Predictive parity is computed from specific entries in the confusion matrix. The following table shows how predictive parity relates to the confusion matrix:
| Component | Definition | Role in predictive parity |
|---|---|---|
| True positives (TP) | Correctly predicted positives | Numerator of PPV |
| False positives (FP) | Incorrectly predicted positives | Denominator term of PPV |
| True negatives (TN) | Correctly predicted negatives | Not directly involved |
| False negatives (FN) | Incorrectly predicted negatives | Not directly involved |
| PPV (precision) | TP / (TP + FP) | Must be equal across groups |
| FDR | FP / (TP + FP) = 1 - PPV | Equal when PPV is equal |
Predictive parity is one of several group fairness metrics. Each metric enforces a different statistical constraint, and they reflect different intuitions about what it means for an algorithm to be fair.
| Fairness metric | Formal condition | Fairness family | Intuition |
|---|---|---|---|
| Predictive parity | P(Y=1 | Ŷ=1, A=a) = P(Y=1 | Ŷ=1, A=b) | Sufficiency | Positive predictions are equally trustworthy across groups |
| Equalized odds | P(Ŷ=1 | Y=y, A=a) = P(Ŷ=1 | Y=y, A=b) for y in {0,1} | Separation | Equal true positive rates and equal false positive rates across groups |
| Equal opportunity | P(Ŷ=1 | Y=1, A=a) = P(Ŷ=1 | Y=1, A=b) | Separation | Equal true positive rates across groups |
| Demographic parity | P(Ŷ=1 | A=a) = P(Ŷ=1 | A=b) | Independence | Equal selection rates across groups |
| Calibration | P(Y=1 | S=s, A=a) = P(Y=1 | S=s, A=b) = s | Sufficiency | Predicted probabilities match true probabilities for all groups |
| Counterfactual fairness | P(Ŷ_A=a | X) = P(Ŷ_A=b | X) | Individual / Causal | Prediction would remain the same if the individual's group membership were changed |
Several foundational results in algorithmic fairness prove that predictive parity cannot, in general, be satisfied simultaneously with other fairness criteria. These results are sometimes collectively called the "impossibility of fairness" theorems.
Alexandra Chouldechova proved that when the base rates (prevalence of the positive outcome) differ between two groups, it is mathematically impossible for any classifier to simultaneously satisfy both predictive parity and classification parity (equal false positive rates and equal false negative rates).
The proof rests on the following identity derived from Bayes' theorem:
PPV / (1 - PPV) = (p / (1 - p)) * ((1 - FNR) / FPR)
where p is the base rate, FNR is the false negative rate, and FPR is the false positive rate.
If PPV is held equal across two groups (predictive parity) and the base rates p differ, then the ratio (1 - FNR) / FPR must also differ. This means the error rates cannot be equal. Conversely, if FPR and FNR are equalized across groups (classification parity), then PPV must differ whenever base rates differ.
This result has a direct practical implication: in any domain where the prevalence of the outcome varies across demographic groups (which is common in criminal justice, healthcare, and lending), a designer must choose between calibrating the predictions equally (predictive parity) or distributing errors equally (classification parity). Both cannot be achieved at once.
Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan proved a related but distinct impossibility result. They defined three fairness conditions for risk scores:
Their theorem shows that except in highly constrained special cases, no risk score can satisfy all three conditions simultaneously. The two special cases are: (a) perfect prediction, where the classifier makes no errors, and (b) equal base rates across groups. In all other realistic scenarios, at least one condition must be violated.
Since calibration within groups implies predictive parity, this theorem further constrains the feasibility of achieving predictive parity alongside error-rate-based fairness.
Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Weinberger investigated the tension between calibration and error rate balance more closely. They showed that calibration is compatible with at most one error constraint (specifically, equal false negative rates across groups) and that any algorithm satisfying this relaxation is no better than randomizing a fraction of predictions from an existing classifier.
The most prominent real-world illustration of the predictive parity tradeoff is the controversy over COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), a risk assessment tool used in the U.S. criminal justice system to predict recidivism.
In May 2016, the investigative journalism organization ProPublica published an article titled "Machine Bias" by Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Their analysis examined COMPAS risk scores for over 10,000 criminal defendants in Broward County, Florida, and found that Black defendants were roughly twice as likely as white defendants to be falsely labeled as high risk for recidivism (higher false positive rate), while white defendants labeled as low risk were more likely to go on to reoffend (higher false negative rate).
Northpointe (the company that developed COMPAS, now called Equivant) responded in July 2016 with a report authored by William Dieterich, Christina Mendoza, and Tim Brennan titled "COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity." They argued that the algorithm was fair because it satisfied predictive parity: among defendants classified as high risk, roughly the same proportion of Black and white defendants actually went on to reoffend. In other words, a COMPAS score of 7 meant approximately the same probability of recidivism regardless of race.
The two sides were measuring fairness by different metrics:
| Stakeholder | Fairness metric used | What it requires | Finding |
|---|---|---|---|
| ProPublica | Error rate balance (classification parity) | Equal FPR and FNR across racial groups | COMPAS fails: Black defendants have higher FPR |
| Northpointe | Predictive parity (calibration) | Equal PPV across racial groups | COMPAS passes: PPV is roughly equal across groups |
Because the base rate of recidivism differed between Black and white defendants in the Broward County data, Chouldechova's impossibility theorem explains why COMPAS could not satisfy both criteria simultaneously. The disagreement was not merely a matter of analytic error by either party; it reflected a genuine mathematical constraint.
The COMPAS debate had a significant impact on the field of algorithmic fairness. It prompted the development of formal impossibility theorems, spurred new research on post-processing methods for bias mitigation, and raised public awareness about the normative choices embedded in algorithmic systems. Dressel and Farid (2018) further showed that COMPAS was no more accurate than predictions made by untrained people recruited from Amazon Mechanical Turk, and that equivalent accuracy could be achieved with a simple linear classifier using only two features.
Predictive parity is relevant in any domain where an algorithm makes binary or risk-based predictions that affect people differently across demographic groups.
Risk assessment instruments like COMPAS, PSA (Public Safety Assessment), and ORAS (Ohio Risk Assessment System) assign risk scores to defendants for pretrial release, sentencing, and parole decisions. Predictive parity requires that a given risk score carries the same meaning regardless of the defendant's race, gender, or other protected characteristics.
Clinical prediction models are used to estimate disease risk, allocate medical resources, and prioritize patients. Predictive parity ensures that when a model predicts a patient is at high risk for a condition, that prediction is equally reliable for patients of different races, ethnicities, ages, or genders. Violations can lead to systematic over-treatment or under-treatment of certain populations.
Credit scoring models predict the likelihood that a borrower will repay a loan. Predictive parity in this context means that among all applicants approved for a loan (positive prediction), the fraction who actually repay should be the same across demographic groups. This ensures that the model's approvals carry equal financial meaning regardless of the applicant's background.
Automated resume screening and admissions algorithms predict candidate success. Predictive parity requires that a positive prediction (e.g., "this candidate will succeed") is equally accurate across demographic groups, so that members of no group are disproportionately subject to incorrect positive predictions.
Several techniques can be used to move a classifier toward predictive parity, though each involves tradeoffs with other fairness criteria and overall accuracy.
Pre-processing approaches modify the training data before the model is built. Techniques include resampling to balance class distributions within each group, reweighting training examples, and transforming features to remove correlations with the protected attribute. These methods aim to produce data from which a standard classifier will naturally produce predictions with equal PPV across groups.
In-processing approaches modify the learning algorithm itself. Constraints can be added to the optimization objective that penalize differences in PPV across groups during training. Regularization terms can also be designed to encourage sufficiency-based fairness. These methods tend to be model-specific and require access to the training pipeline.
Post-processing approaches adjust the outputs of an already-trained model. The most common technique is group-specific threshold adjustment: rather than using a single classification threshold for all groups, different thresholds are chosen for each group so that the resulting PPV is equalized. This approach is model-agnostic and does not require retraining, making it practical for organizations using off-the-shelf models.
A 2022 study by Dwork et al. proposed a model-agnostic post-processing transformation function specifically designed to achieve predictive rate parity with minimal impact on overall model performance.
| Method type | When applied | Requires retraining | Model-agnostic | Typical tradeoff |
|---|---|---|---|---|
| Pre-processing | Before training | Yes | Yes | May reduce overall accuracy |
| In-processing | During training | Yes | No | Added complexity in optimization |
| Post-processing | After training | No | Yes | May sacrifice other fairness metrics |
The most fundamental challenge for predictive parity is that base rates (the prevalence of the positive outcome) frequently differ across demographic groups. As Chouldechova's theorem demonstrates, when base rates differ, achieving predictive parity forces unequal error rates. This means that one group will experience a higher false positive rate or a higher false negative rate than another, even when PPV is held constant.
Predictive parity relies on accurate labels in the training and evaluation data. If the data reflects historical biases (for example, if arrest records reflect biased policing patterns rather than true criminal behavior), then the "ground truth" labels themselves are corrupted. Achieving predictive parity with respect to biased labels does not guarantee fairness with respect to the true underlying outcomes.
The outcome test (predictive parity) can be misleading when applied to aggregate statistics rather than marginal decisions. Two groups can have equal PPV overall while the classifier still discriminates at the margin. This problem, known as infra-marginality, means that predictive parity at the aggregate level does not necessarily imply fairness in individual cases.
Predictive parity is a group-level fairness criterion. It does not guarantee that similar individuals from different groups receive similar predictions. A classifier can satisfy predictive parity while treating specific individuals unfairly, as long as the group-level PPV statistics are balanced. Individual fairness, which requires that similar individuals receive similar predictions, addresses a complementary concern.
Predictive parity focuses only on positive predictions (Ŷ = 1). It does not constrain what happens among negative predictions. A classifier could have equal PPV but very different negative predictive values (NPV) across groups, meaning that negative predictions are more reliable for one group than another.
Imagine a teacher who gives gold stars to students she thinks did well on a test. Predictive parity means that when she gives a gold star, she is right about the same percentage of the time for boys and for girls. If she gives gold stars to 10 boys and 8 of them actually did well (80% correct), then when she gives gold stars to 10 girls, about 8 of them should have actually done well too (also 80% correct).
If the gold stars are right 80% of the time for boys but only 60% of the time for girls, that is a problem: the gold star means something different depending on whether you are a boy or a girl. Predictive parity says the gold star should be worth the same for everyone.
The tricky part is that making the gold stars equally accurate for boys and girls might mean the teacher makes more mistakes in other ways. For example, she might miss more girls who actually did well (give them no star when they deserve one). That is the tradeoff that mathematicians have proven is unavoidable when the two groups have different average scores.