Predictive rate parity

Predictive rate parity (PRP), also called predictive parity, predictive value parity, or the sufficiency criterion, is a group fairness metric in machine learning that requires a classifier's positive predictive value (PPV), also known as precision, to be equal across all demographic groups defined by a sensitive attribute such as race, gender, or age. When a binary classifier satisfies predictive rate parity, the probability that a positive prediction corresponds to a true positive outcome does not depend on which protected group the individual belongs to. The metric is sometimes called the outcome test or test fairness in the academic literature on algorithmic fairness.

Predictive rate parity sits inside the broader sufficiency family of fairness criteria, which require that the true outcome variable Y is conditionally independent of the protected attribute A given the classifier's prediction. Sufficiency stands in contrast to separation based criteria such as equalized odds and equal opportunity, and to independence based criteria such as demographic parity. The tensions between these three families of fairness definitions are formalized in several impossibility theorems published in 2016 and 2017 that have shaped the entire field of algorithmic fairness.

The concept rose to prominence after the 2016 ProPublica investigation of the COMPAS recidivism risk assessment tool, where the developer Northpointe defended its model on the grounds of predictive rate parity while ProPublica criticized it on the grounds of unequal error rates. Alexandra Chouldechova's 2017 paper formally proved that this disagreement reflected an unavoidable mathematical constraint: when the base rates of an outcome differ between two groups, no classifier can achieve both predictive rate parity and equal error rates at the same time.

Formal definition

Let Y denote the true binary outcome (1 = positive, 0 = negative), let A denote a sensitive attribute with groups a and b (for example, race or gender), and let Ŷ denote the classifier's predicted outcome.

A classifier satisfies predictive rate parity if and only if:

P(Y = 1 | Ŷ = 1, A = a) = P(Y = 1 | Ŷ = 1, A = b)

In words, the positive predictive value (PPV) must be the same for every group. Because PPV = TP / (TP + FP), this is equivalent to requiring equal precision across groups. A classifier that satisfies equal PPV will also satisfy equal false discovery rates (FDR = 1 - PPV) across groups:

P(Y = 0 | Ŷ = 1, A = a) = P(Y = 0 | Ŷ = 1, A = b)

A stronger version of the criterion, called conditional use accuracy equality, requires equal PPV and equal negative predictive value (NPV) at the same time. The strongest version is calibration, which requires the predicted probability to match the empirical positive rate at every score level for every group.

Connection to positive predictive value

Predictive rate parity is fundamentally a constraint on PPV. PPV answers the question: "Given that the model predicted positive, what is the probability the true label is positive?" It is a measure of the trustworthiness of positive predictions and is a familiar quantity outside fairness research, for example in clinical diagnostics where it is reported alongside sensitivity and specificity.

The table below shows how predictive rate parity relates to the standard entries of the confusion matrix:

Component	Definition	Role in predictive rate parity
True positives (TP)	Correctly predicted positives	Numerator of PPV
False positives (FP)	Incorrectly predicted positives	Denominator term of PPV
True negatives (TN)	Correctly predicted negatives	Not directly involved
False negatives (FN)	Incorrectly predicted negatives	Not directly involved
PPV (precision)	TP / (TP + FP)	Must be equal across groups
FDR	FP / (TP + FP) = 1 - PPV	Equal when PPV is equal
Selection rate	(TP + FP) / N	Used by demographic parity, not PRP
TPR (recall)	TP / (TP + FN)	Used by equal opportunity, not PRP
FPR	FP / (FP + TN)	Used by equalized odds, not PRP

Connection to sufficiency and calibration

Predictive rate parity is a special case of the sufficiency criterion, which requires that for all values of ŷ and y:

P(Y = y | Ŷ = ŷ, A = a) = P(Y = y | Ŷ = ŷ, A = b)

Sufficiency states that the sensitive attribute provides no additional information about the true outcome once the prediction is known. Predictive rate parity enforces this condition only at Ŷ = 1 (positive predictions). Conditional use accuracy equality enforces it at both Ŷ = 1 and Ŷ = 0.

Predictive rate parity is also closely related to calibration within groups, sometimes called well calibration or test fairness. A risk score S is calibrated within groups if for every score s:

P(Y = 1 | S = s, A = a) = P(Y = 1 | S = s, A = b) = s

Calibration is strictly stronger than predictive rate parity. A calibrated classifier automatically satisfies PRP, but a classifier satisfying PRP need not be calibrated. The reason is that calibration ties the predicted probability to the true probability at every score level, whereas PRP only equates the aggregate fraction of correct positive predictions across groups.

History

The formal study of predictive rate parity emerged from the practical controversy over recidivism risk scores in the United States criminal justice system, then matured into a theoretical research program through impossibility results published in 2016 and 2017.

Predictive value parity in early statistics

Long before machine learning existed as a field, statisticians and epidemiologists used PPV to describe the reliability of diagnostic and screening tests. The 1980s and 1990s literature on differential prediction in psychometrics and economics already discussed the problem of test bias as a question about whether predicted outcomes carry the same meaning across subgroups, a notion close to what would later be formalized as predictive rate parity. The terminology of "test fairness" used in the COMPAS debate descends from this older psychometric tradition.

The ProPublica COMPAS investigation (2016)

In May 2016, the investigative journalism organization ProPublica published an article titled "Machine Bias" by Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. The article analyzed COMPAS risk scores assigned to over 10,000 criminal defendants in Broward County, Florida, between 2013 and 2014. ProPublica found that Black defendants were nearly twice as likely as white defendants to be incorrectly labeled as high risk for recidivism (a higher false positive rate), while white defendants who later reoffended were more likely to have been incorrectly labeled low risk (a higher false negative rate).

Northpointe's response (2016)

In July 2016, Northpointe (later renamed equivant) published a rebuttal report by William Dieterich, Christina Mendoza, and Tim Brennan titled "COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity." The report argued that COMPAS satisfied predictive rate parity: a high risk score corresponded to roughly the same probability of recidivism for Black and white defendants, so the score had the same meaning in both groups. From Northpointe's perspective, equal PPV across racial groups was the appropriate fairness standard.

Chouldechova's impossibility result (2017)

Alexandra Chouldechova's paper "Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments," published in Big Data in 2017, formalized the disagreement. Chouldechova proved that when base rates differ between groups, predictive rate parity and classification parity (equal FPR and equal FNR) cannot both hold. Her proof rests on an identity derived from Bayes' theorem:

FPR = (p / (1 - p)) * ((1 - PPV) / PPV) * (1 - FNR)

where p is the base rate of the positive outcome. If PPV is equalized across groups but the base rate p differs, then FPR and FNR cannot both be equal across groups. The 2017 paper made it clear that the COMPAS controversy reflected a genuine mathematical impossibility, not analytic error by either ProPublica or Northpointe.

Kleinberg, Mullainathan, and Raghavan (2017)

In parallel, Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan published "Inherent Trade-Offs in the Fair Determination of Risk Scores" at the 8th Innovations in Theoretical Computer Science (ITCS) conference in 2017 (the preprint appeared on arXiv in September 2016). They proved a related but distinct impossibility theorem for risk scores. Their result extended the COMPAS analysis from binary classifiers to continuous risk scores and identified precise conditions under which calibration within groups is incompatible with balanced error rates.

Hardt, Price, and Srebro (2016)

Moritz Hardt, Eric Price, and Nathan Srebro's NeurIPS 2016 paper "Equality of Opportunity in Supervised Learning" proposed equalized odds and equal opportunity as separation based alternatives to demographic parity. Their post processing method assumed the classifier's score was already given, and adjusted thresholds per group to satisfy the new criterion. Together with Chouldechova and Kleinberg et al., this paper established the standard taxonomy of group fairness metrics that PRP belongs to.

Pleiss et al. (2017)

Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Weinberger's NeurIPS 2017 paper "On Fairness and Calibration" sharpened the impossibility analysis. They showed that calibration within groups is compatible with at most one of the two error rate conditions (equal FPR or equal FNR), and that any algorithm that achieves both calibration and a single error rate match is no better than randomizing predictions from an existing classifier with some probability.

Survey and consolidation (2018-2021)

Sahil Verma and Julia Rubin's 2018 paper "Fairness Definitions Explained," presented at the FairWare workshop, catalogued more than twenty fairness definitions and provided unified notation. The 2021 ACM Computing Surveys paper "A Survey on Bias and Fairness in Machine Learning" by Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan extended this to a comprehensive review. The freely available textbook Fairness and Machine Learning: Limitations and Opportunities by Solon Barocas, Moritz Hardt, and Arvind Narayanan (first published online in 2019, with a print edition from MIT Press in 2023) settled on the independence, separation, sufficiency trichotomy that organizes most modern teaching of group fairness.

The independence, separation, sufficiency trichotomy

Group fairness criteria are commonly grouped into three mutually incompatible families. Predictive rate parity belongs to the sufficiency family.

Family	Condition	Example metrics	What it requires
Independence	Ŷ ⊥ A	Demographic parity, statistical parity, disparate impact	Predictions are independent of group membership
Separation	Ŷ ⊥ A \| Y	Equalized odds, equal opportunity, balance for negatives	Predictions are independent of group membership given the true outcome
Sufficiency	Y ⊥ A \| Ŷ	Predictive rate parity, predictive value parity, calibration	True outcome is independent of group membership given the prediction

The Barocas, Hardt, Narayanan textbook proves that any two of these three conditions can only hold simultaneously in degenerate special cases (such as perfect prediction, or equal base rates). Predictive rate parity belongs to the sufficiency family because it constrains the conditional distribution of Y given Ŷ, restricted to the case Ŷ = 1.

Comparison with other fairness metrics

The table below contrasts predictive rate parity with the other most commonly cited group fairness metrics. Each row shows the formal condition, the family it belongs to, and the everyday intuition behind it.

Fairness metric	Formal condition	Family	Intuition
Predictive rate parity	P(Y=1 \| Ŷ=1, A=a) = P(Y=1 \| Ŷ=1, A=b)	Sufficiency	Positive predictions are equally trustworthy across groups
Conditional use accuracy equality	Equal PPV and equal NPV	Sufficiency	Both positive and negative predictions are equally trustworthy
Calibration	P(Y=1 \| S=s, A=a) = s for every s and group	Sufficiency	Predicted probabilities match true probabilities at every score level
Equalized odds	P(Ŷ=1 \| Y=y, A=a) = P(Ŷ=1 \| Y=y, A=b) for y∈{0,1}	Separation	Equal true positive rates and equal false positive rates across groups
Equal opportunity	P(Ŷ=1 \| Y=1, A=a) = P(Ŷ=1 \| Y=1, A=b)	Separation	Equal true positive rates across groups
Predictive equality	P(Ŷ=1 \| Y=0, A=a) = P(Ŷ=1 \| Y=0, A=b)	Separation	Equal false positive rates across groups
Demographic parity	P(Ŷ=1 \| A=a) = P(Ŷ=1 \| A=b)	Independence	Equal positive prediction rates across groups
Disparate impact (80% rule)	P(Ŷ=1 \| A=a) / P(Ŷ=1 \| A=b) ≥ 0.8	Independence	Selection rates differ by less than the legal threshold
Counterfactual fairness	Prediction is unchanged in a counterfactual world where A is altered	Causal	Group membership has no causal effect on the prediction
Individual fairness	Similar individuals receive similar predictions	Individual	Predictions respect a similarity metric on individuals

The critical observation from this table is that predictive rate parity, equalized odds, and demographic parity each protect a different statistical quantity. They cannot in general all be satisfied at once, so a designer must choose which definition reflects the values relevant to a particular application.

Impossibility theorems

Chouldechova's theorem

Chouldechova (2017) proved that when the base rates p differ between two groups, no binary classifier can simultaneously satisfy predictive rate parity and classification parity (equal FPR and equal FNR). The proof uses the identity:

FPR = (p / (1 - p)) * ((1 - PPV) / PPV) * (1 - FNR)

If PPV is equal across groups (PRP holds) and the base rates differ, then FPR and FNR cannot both be equal. Conversely, if FPR and FNR are equalized (classification parity), PPV must differ whenever base rates differ.

This result has immediate practical force. In any domain where the prevalence of the outcome varies across demographic groups, which is common in criminal justice, healthcare, credit scoring, and hiring, a designer must choose between calibrating the predictions (predictive rate parity) and balancing the error rates (classification parity). Both cannot be achieved at once except in trivial cases.

Kleinberg, Mullainathan, and Raghavan's theorem

Kleinberg, Mullainathan, and Raghavan (2017) defined three fairness conditions for risk scores:

Calibration within groups. Among individuals assigned a particular risk score, the fraction who are actually positive is the same across groups, and equal to the score itself.
Balance for the positive class. The average risk score assigned to actually positive individuals is the same across groups.
Balance for the negative class. The average risk score assigned to actually negative individuals is the same across groups.

They proved that no risk score can satisfy all three conditions except in two special cases: (a) perfect prediction, where the classifier makes no errors; and (b) equal base rates across groups. In every realistic scenario, at least one condition must be violated. Because calibration within groups implies predictive rate parity, this theorem extends Chouldechova's result from binary classifiers to continuous risk scores.

Pleiss, Raghavan, Wu, Kleinberg, Weinberger (2017)

Pleiss et al. (2017) sharpened the picture for calibrated probability scores. They showed that calibration within groups is compatible with at most one of the two error rate conditions (equal FPR alone or equal FNR alone, not both), and that any algorithm achieving both calibration and one error rate match is no more useful than randomizing predictions from a baseline classifier with some probability. The conclusion is that strict calibration plus error balance can only be enforced by deliberately degrading model performance.

Friedler, Scheidegger, and Venkatasubramanian

Sorelle Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian's 2016 paper "On the (Im)possibility of Fairness" framed the impossibility results as the consequence of two competing worldviews: the What You See Is What You Get worldview, in which observed features fairly represent ability, and the We're All Equal worldview, in which any group differences in observed features are the product of structural bias. PRP and demographic parity reflect these two worldviews and are therefore not reconcilable on purely technical grounds.

Trade-offs

Because of the impossibility theorems, every choice of fairness metric carries trade-offs against other metrics and against accuracy. Adopting predictive rate parity has specific consequences worth spelling out:

Trade-off	Effect of enforcing PRP
Versus equalized odds	When base rates differ, equal PPV forces unequal FPR or unequal FNR
Versus demographic parity	Equal PPV does not constrain selection rates, so groups can be selected at very different rates
Versus accuracy	Adjusting thresholds per group to equalize PPV typically reduces overall accuracy
Versus negative predictions	PRP says nothing about NPV, so negative predictions can still differ in reliability
Versus individual fairness	Equal group level PPV does not protect individuals from being treated unlike similar people in another group
Versus utility	If different costs apply to false positives across groups, equal PPV may not equalize harm

Corbett-Davies and Goel (2018) and Corbett-Davies, Pierson, Feller, Goel, and Huq (2017) argued that predictive rate parity together with the use of a single risk threshold maximizes a particular notion of public safety utility, but does so by accepting unequal error rates whenever base rates differ. The choice of which trade-off to accept is a normative one, not a technical one.

Tools and software

Several open source libraries support the measurement and mitigation of predictive rate parity violations as part of their broader fairness toolkit.

Tool	Maintainer	Language	Predictive rate parity support
Fairlearn	Microsoft	Python	Reports `selection_rate`, `true_positive_rate`, `false_positive_rate`, and `precision` (PPV) by group; supports threshold optimization for parity
AIF360 (AI Fairness 360)	IBM (now LF AI & Data Foundation)	Python and R	Includes `BinaryLabelDatasetMetric` and `ClassificationMetric` with explicit `positive_predictive_value_difference` and reweighting, prejudice remover, and reject option algorithms
What-If Tool	Google	TensorBoard plugin	Visualizes per-group PPV and lets users experiment with per-group thresholds
Aequitas	Center for Data Science and Public Policy, University of Chicago	Python	Computes a fairness audit table including `ppr` (predicted positive rate) and `precision` disparities
Themis-ML	Niels Bantilan	Python	Implements pre processing and in processing techniques relevant to PRP
FairTest	Columbia and ETH	Python	Statistical investigation framework for unequal predictive value

Fairlearn's MetricFrame API makes it straightforward to compute PPV per group:

from sklearn.metrics import precision_score
from fairlearn.metrics import MetricFrame

frame = MetricFrame(
    metrics={"ppv": precision_score},
    y_true=y_test,
    y_pred=y_pred,
    sensitive_features=A_test,
)
print(frame.by_group)
print("PPV difference:", frame.difference())

AIF360 exposes the disparity directly:

from aif360.metrics import ClassificationMetric

cm = ClassificationMetric(
    dataset_true,
    dataset_pred,
    privileged_groups=[{"race": 1}],
    unprivileged_groups=[{"race": 0}],
)
print(cm.positive_predictive_value(privileged=False))
print(cm.positive_predictive_value(privileged=True))

Methods for achieving predictive rate parity

Methods to bring a classifier toward predictive rate parity follow the standard taxonomy used across the fairness literature.

Method type	When applied	Requires retraining	Model agnostic	Typical trade-off
Pre processing	Before training	Yes	Yes	Modifies training data through reweighting, resampling, or representation learning
In processing	During training	Yes	No	Adds a constraint or regularizer on PPV disparity to the loss function
Post processing	After training	No	Yes	Picks a different decision threshold per group so that PPV is equalized

Pre processing

Pre processing approaches reshape the training data so that a standard learner naturally produces predictions with equal PPV. Faisal Kamiran and Toon Calders' 2012 reweighting method assigns example weights based on the joint distribution of the protected attribute and the label. Optimized pre processing by Calmon et al. (NeurIPS 2017) jointly transforms features and labels to minimize a divergence subject to fairness constraints.

In processing

In processing approaches add a fairness penalty to the optimization objective. Zafar et al. (2017) proposed convex margin based formulations that constrain disparate mistreatment, including the difference in PPV across groups. Agarwal et al. (ICML 2018) developed a reductions approach that turns fair classification into a sequence of cost sensitive learning problems and supports any classifier as a black box base learner.

Post processing

The most widely deployed approach is group specific threshold selection. After a classifier is trained, the analyst picks a separate decision threshold for each group so that the resulting PPVs are equal. Group specific thresholds are also the basis of the equalized odds post processing of Hardt, Price, and Srebro (2016). Pleiss et al. (2017) provide a randomized variant that preserves calibration to the extent possible.

A 2022 study by a team led by Cynthia Dwork at Harvard, Microsoft Research, and Stanford proposed a model agnostic post processing transformation function specifically designed to enforce predictive rate parity with minimal impact on overall model performance.

Use cases

Predictive rate parity is meaningful in any domain where an algorithm produces binary decisions or risk scores that affect people from multiple demographic groups.

Criminal justice

Risk assessment instruments such as COMPAS, the Public Safety Assessment (PSA), and the Ohio Risk Assessment System (ORAS) assign risk scores to defendants for pretrial release, sentencing, and parole. Predictive rate parity in this setting requires that a high risk classification carry the same probability of reoffense regardless of the defendant's race or gender. Northpointe's defense of COMPAS rested on this criterion. Most state and county systems that have adopted these tools include some form of predictive validity audit by group.

Healthcare

Clinical prediction models are used to estimate disease risk, allocate resources such as ICU beds and organ transplants, and prioritize patients in screening programs. Predictive rate parity ensures that when a model flags a patient as high risk for a condition, the prediction is equally reliable for patients of different races, genders, and ages. The 2019 Science paper by Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan analyzed a widely used commercial algorithm that used healthcare cost as a proxy for healthcare need, and found that Black patients had to be much sicker than white patients to receive the same risk score. Restoring predictive rate parity in this case required replacing the cost based label with a direct measure of illness severity.

Credit scoring and lending

Credit scoring models predict the likelihood that a borrower will repay a loan. Predictive rate parity requires that among approved applicants from different demographic groups, the fraction who actually repay is the same. The U.S. Equal Credit Opportunity Act prohibits discrimination on the basis of race, sex, religion, national origin, marital status, age, and receipt of public assistance. PRP is one technical standard a lender can adopt to demonstrate that its approvals carry consistent meaning across groups.

Hiring and admissions

Automated resume screening, video interview scoring, and admissions algorithms predict candidate success. Predictive rate parity requires that a positive prediction ("this candidate will succeed") is equally accurate across demographic groups. The 2019 Amazon resume screening tool, abandoned after evidence that it disadvantaged women, illustrates the need to audit such systems. New York City's Local Law 144 of 2021, in force since July 2023, requires bias audits of automated employment decision tools that include explicit calculation of selection rates and impact ratios by group, with predictive rate parity often used as a complementary auditing metric.

Online advertising and content moderation

Ad targeting and content moderation systems classify users and content for selection and removal decisions. Predictive rate parity is one of several metrics used in audits of these systems, alongside selection rate disparity (related to demographic parity) and error rate disparity (related to equalized odds).

Limitations

Unequal base rates

The most fundamental limitation of predictive rate parity is the constraint imposed by Chouldechova's theorem. When the base rates differ across groups, equalizing PPV forces unequal error rates. In domains such as recidivism prediction where base rates do differ, choosing PRP means accepting that some groups will have a higher false positive rate than others.

Dependence on label quality

Predictive rate parity treats observed labels as ground truth. If the labels themselves reflect historical bias, for example arrest records that reflect biased policing rather than true criminal behavior, then equalizing PPV with respect to corrupted labels does not produce fairness with respect to the underlying outcomes. Suresh and Guttag (2021) catalogued the sources of label and measurement bias that can undermine any fairness metric that takes labels as given.

Infra-marginality

The outcome test (predictive rate parity) can be misleading when applied to aggregate statistics rather than to marginal decisions. Two groups can have equal PPV overall while the classifier still discriminates at the decision threshold. This problem, known as infra-marginality, was discussed in detail by Camelia Simoiu, Sam Corbett-Davies, and Sharad Goel in their 2017 Annals of Applied Statistics paper on the outcome test in policing. Aggregate PPV equality does not imply fairness in individual cases.

Tension with individual fairness

PRP is a group level criterion. It does not guarantee that similar individuals from different groups receive similar predictions. A classifier can satisfy PRP while still treating specific people unfairly, as long as the aggregate PPV statistics are balanced. Individual fairness, introduced by Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel in 2012, addresses this complementary concern by requiring that the prediction function satisfies a Lipschitz condition with respect to a task specific similarity metric.

Negative predictions are unconstrained

Predictive rate parity restricts only the positive prediction case (Ŷ = 1). A classifier could have equal PPV but very different negative predictive values (NPV) across groups, meaning that negative predictions are more reliable for one group than for another. Conditional use accuracy equality addresses this by requiring equal NPV in addition to equal PPV.

Sensitivity to thresholding

PPV depends on the decision threshold. A model can satisfy PRP at one threshold and violate it at another. Audits should report PPV at multiple operating points or use threshold free measures such as PR AUC alongside the parity metric.

Predictive rate parity is purely statistical. It does not consider whether the disparity in predictions arises from a causal effect of the protected attribute or from a legitimate predictor that happens to correlate with it. Counterfactual fairness, introduced by Matt Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva at NeurIPS 2017, takes a causal view that PRP cannot represent.

Application risk and gaming

When PRP becomes a regulatory or audit target, organizations have an incentive to optimize for the metric in ways that satisfy the letter but not the spirit. Selecting a non representative validation set, choosing favorable thresholds, or excluding marginal cases can all produce equal aggregate PPV without producing fair outcomes.

Worked example

Consider a simplified loan approval setting with two equally sized groups A and B. Suppose the classifier produces the following confusion matrices.

Quantity	Group A	Group B
True positives (TP)	80	60
False positives (FP)	20	15
Selected (TP + FP)	100	75
PPV (precision)	0.80	0.80
False negatives (FN)	20	40
True negatives (TN)	80	85
Actual positives (TP + FN)	100	100
TPR (recall)	0.80	0.60
Base rate	0.50	0.50

In this example PPV is 0.80 in both groups, so predictive rate parity holds. However, the true positive rate (recall) is 0.80 in group A and 0.60 in group B, so equal opportunity and equalized odds are violated. The selection rate (100/200 vs 75/200) is also different, so demographic parity is violated. This illustrates how PRP can hold while other fairness metrics fail.

Explain like I'm 5 (ELI5)

Imagine a teacher who gives gold stars to the students she thinks did well on a test. Predictive rate parity means that when she gives a gold star, she is right about the same percentage of the time for boys and for girls. If she gives gold stars to 10 boys and 8 of them really did well (80% correct), then when she gives gold stars to 10 girls, about 8 of them should really have done well too (also 80% correct).

If the gold stars are right 80% of the time for boys but only 60% of the time for girls, that is a problem. The gold star means something different depending on whether you are a boy or a girl. Predictive rate parity says the gold star should be worth the same for everyone.

The tricky part is that making the gold stars equally accurate for boys and girls might mean the teacher makes more mistakes in other ways. For example, she might miss more girls who really did well (and give them no star when they deserve one). That is the trade-off that mathematicians have proven is unavoidable when the two groups have different average scores.

References

Chouldechova, A. (2017). "Fair prediction with disparate impact: A study of bias in recidivism prediction instruments." *Big Data*, 5(2), 153-163.
Kleinberg, J., Mullainathan, S., & Raghavan, M. (2017). "Inherent trade-offs in the fair determination of risk scores." *Proceedings of the 8th Innovations in Theoretical Computer Science Conference (ITCS)*. arXiv:1609.05807.
Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). "Machine bias." *ProPublica*, May 23, 2016.
Dieterich, W., Mendoza, C., & Brennan, T. (2016). "COMPAS risk scales: Demonstrating accuracy equity and predictive parity." Northpointe Inc.
Hardt, M., Price, E., & Srebro, N. (2016). "Equality of opportunity in supervised learning." *Advances in Neural Information Processing Systems (NeurIPS)*, 29, 3315-3323.
Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., & Weinberger, K. Q. (2017). "On fairness and calibration." *Advances in Neural Information Processing Systems (NeurIPS)*, 30.
Verma, S., & Rubin, J. (2018). "Fairness definitions explained." *Proceedings of the IEEE/ACM International Workshop on Software Fairness (FairWare)*, 1-7.
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). "A survey on bias and fairness in machine learning." *ACM Computing Surveys*, 54(6), 1-35.
Friedler, S. A., Scheidegger, C., & Venkatasubramanian, S. (2016). "On the (im)possibility of fairness." arXiv:1609.07236.
Corbett-Davies, S., & Goel, S. (2018). "The measure and mismeasure of fairness: A critical review of fair machine learning." arXiv:1808.00023.
Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., & Huq, A. (2017). "Algorithmic decision making and the cost of fairness." *Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 797-806.
Dressel, J., & Farid, H. (2018). "The accuracy, fairness, and limits of predicting recidivism." *Science Advances*, 4(1), eaao5580.
Barocas, S., Hardt, M., & Narayanan, A. (2023). *Fairness and Machine Learning: Limitations and Opportunities*. MIT Press. (online edition first published 2019, fairmlbook.org).
Berk, R., Heidari, H., Jabbari, S., Kearns, M., & Roth, A. (2021). "Fairness in criminal justice risk assessments: The state of the art." *Sociological Methods & Research*, 50(1), 3-44.
Simoiu, C., Corbett-Davies, S., & Goel, S. (2017). "The problem of infra-marginality in outcome tests for discrimination." *Annals of Applied Statistics*, 11(3), 1193-1216.
Kamiran, F., & Calders, T. (2012). "Data preprocessing techniques for classification without discrimination." *Knowledge and Information Systems*, 33(1), 1-33.
Zafar, M. B., Valera, I., Rodriguez, M. G., & Gummadi, K. P. (2017). "Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment." *Proceedings of the 26th International Conference on World Wide Web*, 1171-1180.
Agarwal, A., Beygelzimer, A., Dudik, M., Langford, J., & Wallach, H. (2018). "A reductions approach to fair classification." *Proceedings of the 35th International Conference on Machine Learning (ICML)*, 60-69.
Calmon, F. P., Wei, D., Vinzamuri, B., Ramamurthy, K. N., & Varshney, K. R. (2017). "Optimized pre-processing for discrimination prevention." *Advances in Neural Information Processing Systems (NeurIPS)*, 30.
Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012). "Fairness through awareness." *Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (ITCS)*, 214-226.
Kusner, M. J., Loftus, J., Russell, C., & Silva, R. (2017). "Counterfactual fairness." *Advances in Neural Information Processing Systems (NeurIPS)*, 30.
Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). "Dissecting racial bias in an algorithm used to manage the health of populations." *Science*, 366(6464), 447-453.
Suresh, H., & Guttag, J. (2021). "A framework for understanding sources of harm throughout the machine learning life cycle." *Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO '21)*, 1-9.
Bird, S., Dudik, M., Edgar, R., Horn, B., Lutz, R., Milan, V., Sameki, M., Wallach, H., & Walker, K. (2020). "Fairlearn: A toolkit for assessing and improving fairness in AI." Microsoft Research Technical Report MSR-TR-2020-32.
Bellamy, R. K. E., et al. (2019). "AI Fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias." *IBM Journal of Research and Development*, 63(4/5), 4:1-4:15.
New York City Department of Consumer and Worker Protection. (2023). "Local Law 144 of 2021: Automated employment decision tools final rules." Effective July 5, 2023.

Formal definition

Connection to positive predictive value

Connection to sufficiency and calibration

History

Predictive value parity in early statistics

The ProPublica COMPAS investigation (2016)

Northpointe's response (2016)

Chouldechova's impossibility result (2017)

Kleinberg, Mullainathan, and Raghavan (2017)

Hardt, Price, and Srebro (2016)

Pleiss et al. (2017)

Survey and consolidation (2018-2021)

The independence, separation, sufficiency trichotomy

Comparison with other fairness metrics

Impossibility theorems

Chouldechova's theorem

Kleinberg, Mullainathan, and Raghavan's theorem

Pleiss, Raghavan, Wu, Kleinberg, Weinberger (2017)

Friedler, Scheidegger, and Venkatasubramanian

Trade-offs

Tools and software

Methods for achieving predictive rate parity

Pre processing

In processing

Post processing

Use cases

Criminal justice

Healthcare

Credit scoring and lending

Hiring and admissions

Online advertising and content moderation

Limitations

Unequal base rates

Dependence on label quality

Infra-marginality

Tension with individual fairness

Negative predictions are unconstrained

Sensitivity to thresholding

Causal blind spots

Application risk and gaming

Worked example

Explain like I'm 5 (ELI5)

See also

References

Improve this article

Related Articles

ARC-AGI 2

Bias

Proxy (sensitive attributes)

Disparate Impact

Disparate Treatment

Counterfactual Fairness

Formal definition

Connection to positive predictive value

Connection to sufficiency and calibration

History

Predictive value parity in early statistics

The ProPublica COMPAS investigation (2016)

Northpointe's response (2016)

Chouldechova's impossibility result (2017)

Kleinberg, Mullainathan, and Raghavan (2017)

Hardt, Price, and Srebro (2016)

Pleiss et al. (2017)

Survey and consolidation (2018-2021)

The independence, separation, sufficiency trichotomy

Comparison with other fairness metrics

Impossibility theorems

Chouldechova's theorem

Kleinberg, Mullainathan, and Raghavan's theorem

Pleiss, Raghavan, Wu, Kleinberg, Weinberger (2017)

Friedler, Scheidegger, and Venkatasubramanian

Trade-offs

Tools and software

Methods for achieving predictive rate parity

Pre processing

In processing

Post processing

Use cases

Criminal justice

Healthcare