Predictive rate parity
Last reviewed
Apr 28, 2026
Sources
26 citations
Review status
Source-backed
Revision
v4 · 5,604 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 28, 2026
Sources
26 citations
Review status
Source-backed
Revision
v4 · 5,604 words
Add missing citations, update stale details, or suggest a clearer explanation.
Predictive rate parity (PRP), also called predictive parity, predictive value parity, or the sufficiency criterion, is a group fairness metric in machine learning that requires a classifier's positive predictive value (PPV), also known as precision, to be equal across all demographic groups defined by a sensitive attribute such as race, gender, or age. When a binary classifier satisfies predictive rate parity, the probability that a positive prediction corresponds to a true positive outcome does not depend on which protected group the individual belongs to. The metric is sometimes called the outcome test or test fairness in the academic literature on algorithmic fairness.
Predictive rate parity sits inside the broader sufficiency family of fairness criteria, which require that the true outcome variable Y is conditionally independent of the protected attribute A given the classifier's prediction. Sufficiency stands in contrast to separation based criteria such as equalized odds and equal opportunity, and to independence based criteria such as demographic parity. The tensions between these three families of fairness definitions are formalized in several impossibility theorems published in 2016 and 2017 that have shaped the entire field of algorithmic fairness.
The concept rose to prominence after the 2016 ProPublica investigation of the COMPAS recidivism risk assessment tool, where the developer Northpointe defended its model on the grounds of predictive rate parity while ProPublica criticized it on the grounds of unequal error rates. Alexandra Chouldechova's 2017 paper formally proved that this disagreement reflected an unavoidable mathematical constraint: when the base rates of an outcome differ between two groups, no classifier can achieve both predictive rate parity and equal error rates at the same time.
Let Y denote the true binary outcome (1 = positive, 0 = negative), let A denote a sensitive attribute with groups a and b (for example, race or gender), and let Ŷ denote the classifier's predicted outcome.
A classifier satisfies predictive rate parity if and only if:
P(Y = 1 | Ŷ = 1, A = a) = P(Y = 1 | Ŷ = 1, A = b)
In words, the positive predictive value (PPV) must be the same for every group. Because PPV = TP / (TP + FP), this is equivalent to requiring equal precision across groups. A classifier that satisfies equal PPV will also satisfy equal false discovery rates (FDR = 1 - PPV) across groups:
P(Y = 0 | Ŷ = 1, A = a) = P(Y = 0 | Ŷ = 1, A = b)
A stronger version of the criterion, called conditional use accuracy equality, requires equal PPV and equal negative predictive value (NPV) at the same time. The strongest version is calibration, which requires the predicted probability to match the empirical positive rate at every score level for every group.
Predictive rate parity is fundamentally a constraint on PPV. PPV answers the question: "Given that the model predicted positive, what is the probability the true label is positive?" It is a measure of the trustworthiness of positive predictions and is a familiar quantity outside fairness research, for example in clinical diagnostics where it is reported alongside sensitivity and specificity.
The table below shows how predictive rate parity relates to the standard entries of the confusion matrix:
| Component | Definition | Role in predictive rate parity |
|---|---|---|
| True positives (TP) | Correctly predicted positives | Numerator of PPV |
| False positives (FP) | Incorrectly predicted positives | Denominator term of PPV |
| True negatives (TN) | Correctly predicted negatives | Not directly involved |
| False negatives (FN) | Incorrectly predicted negatives | Not directly involved |
| PPV (precision) | TP / (TP + FP) | Must be equal across groups |
| FDR | FP / (TP + FP) = 1 - PPV | Equal when PPV is equal |
| Selection rate | (TP + FP) / N | Used by demographic parity, not PRP |
| TPR (recall) | TP / (TP + FN) | Used by equal opportunity, not PRP |
| FPR | FP / (FP + TN) | Used by equalized odds, not PRP |
Predictive rate parity is a special case of the sufficiency criterion, which requires that for all values of ŷ and y:
P(Y = y | Ŷ = ŷ, A = a) = P(Y = y | Ŷ = ŷ, A = b)
Sufficiency states that the sensitive attribute provides no additional information about the true outcome once the prediction is known. Predictive rate parity enforces this condition only at Ŷ = 1 (positive predictions). Conditional use accuracy equality enforces it at both Ŷ = 1 and Ŷ = 0.
Predictive rate parity is also closely related to calibration within groups, sometimes called well calibration or test fairness. A risk score S is calibrated within groups if for every score s:
P(Y = 1 | S = s, A = a) = P(Y = 1 | S = s, A = b) = s
Calibration is strictly stronger than predictive rate parity. A calibrated classifier automatically satisfies PRP, but a classifier satisfying PRP need not be calibrated. The reason is that calibration ties the predicted probability to the true probability at every score level, whereas PRP only equates the aggregate fraction of correct positive predictions across groups.
The formal study of predictive rate parity emerged from the practical controversy over recidivism risk scores in the United States criminal justice system, then matured into a theoretical research program through impossibility results published in 2016 and 2017.
Long before machine learning existed as a field, statisticians and epidemiologists used PPV to describe the reliability of diagnostic and screening tests. The 1980s and 1990s literature on differential prediction in psychometrics and economics already discussed the problem of test bias as a question about whether predicted outcomes carry the same meaning across subgroups, a notion close to what would later be formalized as predictive rate parity. The terminology of "test fairness" used in the COMPAS debate descends from this older psychometric tradition.
In May 2016, the investigative journalism organization ProPublica published an article titled "Machine Bias" by Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. The article analyzed COMPAS risk scores assigned to over 10,000 criminal defendants in Broward County, Florida, between 2013 and 2014. ProPublica found that Black defendants were nearly twice as likely as white defendants to be incorrectly labeled as high risk for recidivism (a higher false positive rate), while white defendants who later reoffended were more likely to have been incorrectly labeled low risk (a higher false negative rate).
In July 2016, Northpointe (later renamed equivant) published a rebuttal report by William Dieterich, Christina Mendoza, and Tim Brennan titled "COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity." The report argued that COMPAS satisfied predictive rate parity: a high risk score corresponded to roughly the same probability of recidivism for Black and white defendants, so the score had the same meaning in both groups. From Northpointe's perspective, equal PPV across racial groups was the appropriate fairness standard.
Alexandra Chouldechova's paper "Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments," published in Big Data in 2017, formalized the disagreement. Chouldechova proved that when base rates differ between groups, predictive rate parity and classification parity (equal FPR and equal FNR) cannot both hold. Her proof rests on an identity derived from Bayes' theorem:
FPR = (p / (1 - p)) * ((1 - PPV) / PPV) * (1 - FNR)
where p is the base rate of the positive outcome. If PPV is equalized across groups but the base rate p differs, then FPR and FNR cannot both be equal across groups. The 2017 paper made it clear that the COMPAS controversy reflected a genuine mathematical impossibility, not analytic error by either ProPublica or Northpointe.
In parallel, Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan published "Inherent Trade-Offs in the Fair Determination of Risk Scores" at the 8th Innovations in Theoretical Computer Science (ITCS) conference in 2017 (the preprint appeared on arXiv in September 2016). They proved a related but distinct impossibility theorem for risk scores. Their result extended the COMPAS analysis from binary classifiers to continuous risk scores and identified precise conditions under which calibration within groups is incompatible with balanced error rates.
Moritz Hardt, Eric Price, and Nathan Srebro's NeurIPS 2016 paper "Equality of Opportunity in Supervised Learning" proposed equalized odds and equal opportunity as separation based alternatives to demographic parity. Their post processing method assumed the classifier's score was already given, and adjusted thresholds per group to satisfy the new criterion. Together with Chouldechova and Kleinberg et al., this paper established the standard taxonomy of group fairness metrics that PRP belongs to.
Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Weinberger's NeurIPS 2017 paper "On Fairness and Calibration" sharpened the impossibility analysis. They showed that calibration within groups is compatible with at most one of the two error rate conditions (equal FPR or equal FNR), and that any algorithm that achieves both calibration and a single error rate match is no better than randomizing predictions from an existing classifier with some probability.
Sahil Verma and Julia Rubin's 2018 paper "Fairness Definitions Explained," presented at the FairWare workshop, catalogued more than twenty fairness definitions and provided unified notation. The 2021 ACM Computing Surveys paper "A Survey on Bias and Fairness in Machine Learning" by Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan extended this to a comprehensive review. The freely available textbook Fairness and Machine Learning: Limitations and Opportunities by Solon Barocas, Moritz Hardt, and Arvind Narayanan (first published online in 2019, with a print edition from MIT Press in 2023) settled on the independence, separation, sufficiency trichotomy that organizes most modern teaching of group fairness.
Group fairness criteria are commonly grouped into three mutually incompatible families. Predictive rate parity belongs to the sufficiency family.
| Family | Condition | Example metrics | What it requires |
|---|---|---|---|
| Independence | Ŷ ⊥ A | Demographic parity, statistical parity, disparate impact | Predictions are independent of group membership |
| Separation | Ŷ ⊥ A | Y | Equalized odds, equal opportunity, balance for negatives | Predictions are independent of group membership given the true outcome |
| Sufficiency | Y ⊥ A | Ŷ | Predictive rate parity, predictive value parity, calibration | True outcome is independent of group membership given the prediction |
The Barocas, Hardt, Narayanan textbook proves that any two of these three conditions can only hold simultaneously in degenerate special cases (such as perfect prediction, or equal base rates). Predictive rate parity belongs to the sufficiency family because it constrains the conditional distribution of Y given Ŷ, restricted to the case Ŷ = 1.
The table below contrasts predictive rate parity with the other most commonly cited group fairness metrics. Each row shows the formal condition, the family it belongs to, and the everyday intuition behind it.
| Fairness metric | Formal condition | Family | Intuition |
|---|---|---|---|
| Predictive rate parity | P(Y=1 | Ŷ=1, A=a) = P(Y=1 | Ŷ=1, A=b) | Sufficiency | Positive predictions are equally trustworthy across groups |
| Conditional use accuracy equality | Equal PPV and equal NPV | Sufficiency | Both positive and negative predictions are equally trustworthy |
| Calibration | P(Y=1 | S=s, A=a) = s for every s and group | Sufficiency | Predicted probabilities match true probabilities at every score level |
| Equalized odds | P(Ŷ=1 | Y=y, A=a) = P(Ŷ=1 | Y=y, A=b) for y∈{0,1} | Separation | Equal true positive rates and equal false positive rates across groups |
| Equal opportunity | P(Ŷ=1 | Y=1, A=a) = P(Ŷ=1 | Y=1, A=b) | Separation | Equal true positive rates across groups |
| Predictive equality | P(Ŷ=1 | Y=0, A=a) = P(Ŷ=1 | Y=0, A=b) | Separation | Equal false positive rates across groups |
| Demographic parity | P(Ŷ=1 | A=a) = P(Ŷ=1 | A=b) | Independence | Equal positive prediction rates across groups |
| Disparate impact (80% rule) | P(Ŷ=1 | A=a) / P(Ŷ=1 | A=b) ≥ 0.8 | Independence | Selection rates differ by less than the legal threshold |
| Counterfactual fairness | Prediction is unchanged in a counterfactual world where A is altered | Causal | Group membership has no causal effect on the prediction |
| Individual fairness | Similar individuals receive similar predictions | Individual | Predictions respect a similarity metric on individuals |
The critical observation from this table is that predictive rate parity, equalized odds, and demographic parity each protect a different statistical quantity. They cannot in general all be satisfied at once, so a designer must choose which definition reflects the values relevant to a particular application.
Chouldechova (2017) proved that when the base rates p differ between two groups, no binary classifier can simultaneously satisfy predictive rate parity and classification parity (equal FPR and equal FNR). The proof uses the identity:
FPR = (p / (1 - p)) * ((1 - PPV) / PPV) * (1 - FNR)
If PPV is equal across groups (PRP holds) and the base rates differ, then FPR and FNR cannot both be equal. Conversely, if FPR and FNR are equalized (classification parity), PPV must differ whenever base rates differ.
This result has immediate practical force. In any domain where the prevalence of the outcome varies across demographic groups, which is common in criminal justice, healthcare, credit scoring, and hiring, a designer must choose between calibrating the predictions (predictive rate parity) and balancing the error rates (classification parity). Both cannot be achieved at once except in trivial cases.
Kleinberg, Mullainathan, and Raghavan (2017) defined three fairness conditions for risk scores:
They proved that no risk score can satisfy all three conditions except in two special cases: (a) perfect prediction, where the classifier makes no errors; and (b) equal base rates across groups. In every realistic scenario, at least one condition must be violated. Because calibration within groups implies predictive rate parity, this theorem extends Chouldechova's result from binary classifiers to continuous risk scores.
Pleiss et al. (2017) sharpened the picture for calibrated probability scores. They showed that calibration within groups is compatible with at most one of the two error rate conditions (equal FPR alone or equal FNR alone, not both), and that any algorithm achieving both calibration and one error rate match is no more useful than randomizing predictions from a baseline classifier with some probability. The conclusion is that strict calibration plus error balance can only be enforced by deliberately degrading model performance.
Sorelle Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian's 2016 paper "On the (Im)possibility of Fairness" framed the impossibility results as the consequence of two competing worldviews: the What You See Is What You Get worldview, in which observed features fairly represent ability, and the We're All Equal worldview, in which any group differences in observed features are the product of structural bias. PRP and demographic parity reflect these two worldviews and are therefore not reconcilable on purely technical grounds.
Because of the impossibility theorems, every choice of fairness metric carries trade-offs against other metrics and against accuracy. Adopting predictive rate parity has specific consequences worth spelling out:
| Trade-off | Effect of enforcing PRP |
|---|---|
| Versus equalized odds | When base rates differ, equal PPV forces unequal FPR or unequal FNR |
| Versus demographic parity | Equal PPV does not constrain selection rates, so groups can be selected at very different rates |
| Versus accuracy | Adjusting thresholds per group to equalize PPV typically reduces overall accuracy |
| Versus negative predictions | PRP says nothing about NPV, so negative predictions can still differ in reliability |
| Versus individual fairness | Equal group level PPV does not protect individuals from being treated unlike similar people in another group |
| Versus utility | If different costs apply to false positives across groups, equal PPV may not equalize harm |
Corbett-Davies and Goel (2018) and Corbett-Davies, Pierson, Feller, Goel, and Huq (2017) argued that predictive rate parity together with the use of a single risk threshold maximizes a particular notion of public safety utility, but does so by accepting unequal error rates whenever base rates differ. The choice of which trade-off to accept is a normative one, not a technical one.
Several open source libraries support the measurement and mitigation of predictive rate parity violations as part of their broader fairness toolkit.
| Tool | Maintainer | Language | Predictive rate parity support |
|---|---|---|---|
| Fairlearn | Microsoft | Python | Reports selection_rate, true_positive_rate, false_positive_rate, and precision (PPV) by group; supports threshold optimization for parity |
| AIF360 (AI Fairness 360) | IBM (now LF AI & Data Foundation) | Python and R | Includes BinaryLabelDatasetMetric and ClassificationMetric with explicit positive_predictive_value_difference and reweighting, prejudice remover, and reject option algorithms |
| What-If Tool | TensorBoard plugin | Visualizes per-group PPV and lets users experiment with per-group thresholds | |
| Aequitas | Center for Data Science and Public Policy, University of Chicago | Python | Computes a fairness audit table including ppr (predicted positive rate) and precision disparities |
| Themis-ML | Niels Bantilan | Python | Implements pre processing and in processing techniques relevant to PRP |
| FairTest | Columbia and ETH | Python | Statistical investigation framework for unequal predictive value |
Fairlearn's MetricFrame API makes it straightforward to compute PPV per group:
from sklearn.metrics import precision_score
from fairlearn.metrics import MetricFrame
frame = MetricFrame(
metrics={"ppv": precision_score},
y_true=y_test,
y_pred=y_pred,
sensitive_features=A_test,
)
print(frame.by_group)
print("PPV difference:", frame.difference())
AIF360 exposes the disparity directly:
from aif360.metrics import ClassificationMetric
cm = ClassificationMetric(
dataset_true,
dataset_pred,
privileged_groups=[{"race": 1}],
unprivileged_groups=[{"race": 0}],
)
print(cm.positive_predictive_value(privileged=False))
print(cm.positive_predictive_value(privileged=True))
Methods to bring a classifier toward predictive rate parity follow the standard taxonomy used across the fairness literature.
| Method type | When applied | Requires retraining | Model agnostic | Typical trade-off |
|---|---|---|---|---|
| Pre processing | Before training | Yes | Yes | Modifies training data through reweighting, resampling, or representation learning |
| In processing | During training | Yes | No | Adds a constraint or regularizer on PPV disparity to the loss function |
| Post processing | After training | No | Yes | Picks a different decision threshold per group so that PPV is equalized |
Pre processing approaches reshape the training data so that a standard learner naturally produces predictions with equal PPV. Faisal Kamiran and Toon Calders' 2012 reweighting method assigns example weights based on the joint distribution of the protected attribute and the label. Optimized pre processing by Calmon et al. (NeurIPS 2017) jointly transforms features and labels to minimize a divergence subject to fairness constraints.
In processing approaches add a fairness penalty to the optimization objective. Zafar et al. (2017) proposed convex margin based formulations that constrain disparate mistreatment, including the difference in PPV across groups. Agarwal et al. (ICML 2018) developed a reductions approach that turns fair classification into a sequence of cost sensitive learning problems and supports any classifier as a black box base learner.
The most widely deployed approach is group specific threshold selection. After a classifier is trained, the analyst picks a separate decision threshold for each group so that the resulting PPVs are equal. Group specific thresholds are also the basis of the equalized odds post processing of Hardt, Price, and Srebro (2016). Pleiss et al. (2017) provide a randomized variant that preserves calibration to the extent possible.
A 2022 study by a team led by Cynthia Dwork at Harvard, Microsoft Research, and Stanford proposed a model agnostic post processing transformation function specifically designed to enforce predictive rate parity with minimal impact on overall model performance.
Predictive rate parity is meaningful in any domain where an algorithm produces binary decisions or risk scores that affect people from multiple demographic groups.
Risk assessment instruments such as COMPAS, the Public Safety Assessment (PSA), and the Ohio Risk Assessment System (ORAS) assign risk scores to defendants for pretrial release, sentencing, and parole. Predictive rate parity in this setting requires that a high risk classification carry the same probability of reoffense regardless of the defendant's race or gender. Northpointe's defense of COMPAS rested on this criterion. Most state and county systems that have adopted these tools include some form of predictive validity audit by group.
Clinical prediction models are used to estimate disease risk, allocate resources such as ICU beds and organ transplants, and prioritize patients in screening programs. Predictive rate parity ensures that when a model flags a patient as high risk for a condition, the prediction is equally reliable for patients of different races, genders, and ages. The 2019 Science paper by Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan analyzed a widely used commercial algorithm that used healthcare cost as a proxy for healthcare need, and found that Black patients had to be much sicker than white patients to receive the same risk score. Restoring predictive rate parity in this case required replacing the cost based label with a direct measure of illness severity.
Credit scoring models predict the likelihood that a borrower will repay a loan. Predictive rate parity requires that among approved applicants from different demographic groups, the fraction who actually repay is the same. The U.S. Equal Credit Opportunity Act prohibits discrimination on the basis of race, sex, religion, national origin, marital status, age, and receipt of public assistance. PRP is one technical standard a lender can adopt to demonstrate that its approvals carry consistent meaning across groups.
Automated resume screening, video interview scoring, and admissions algorithms predict candidate success. Predictive rate parity requires that a positive prediction ("this candidate will succeed") is equally accurate across demographic groups. The 2019 Amazon resume screening tool, abandoned after evidence that it disadvantaged women, illustrates the need to audit such systems. New York City's Local Law 144 of 2021, in force since July 2023, requires bias audits of automated employment decision tools that include explicit calculation of selection rates and impact ratios by group, with predictive rate parity often used as a complementary auditing metric.
Ad targeting and content moderation systems classify users and content for selection and removal decisions. Predictive rate parity is one of several metrics used in audits of these systems, alongside selection rate disparity (related to demographic parity) and error rate disparity (related to equalized odds).
The most fundamental limitation of predictive rate parity is the constraint imposed by Chouldechova's theorem. When the base rates differ across groups, equalizing PPV forces unequal error rates. In domains such as recidivism prediction where base rates do differ, choosing PRP means accepting that some groups will have a higher false positive rate than others.
Predictive rate parity treats observed labels as ground truth. If the labels themselves reflect historical bias, for example arrest records that reflect biased policing rather than true criminal behavior, then equalizing PPV with respect to corrupted labels does not produce fairness with respect to the underlying outcomes. Suresh and Guttag (2021) catalogued the sources of label and measurement bias that can undermine any fairness metric that takes labels as given.
The outcome test (predictive rate parity) can be misleading when applied to aggregate statistics rather than to marginal decisions. Two groups can have equal PPV overall while the classifier still discriminates at the decision threshold. This problem, known as infra-marginality, was discussed in detail by Camelia Simoiu, Sam Corbett-Davies, and Sharad Goel in their 2017 Annals of Applied Statistics paper on the outcome test in policing. Aggregate PPV equality does not imply fairness in individual cases.
PRP is a group level criterion. It does not guarantee that similar individuals from different groups receive similar predictions. A classifier can satisfy PRP while still treating specific people unfairly, as long as the aggregate PPV statistics are balanced. Individual fairness, introduced by Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel in 2012, addresses this complementary concern by requiring that the prediction function satisfies a Lipschitz condition with respect to a task specific similarity metric.
Predictive rate parity restricts only the positive prediction case (Ŷ = 1). A classifier could have equal PPV but very different negative predictive values (NPV) across groups, meaning that negative predictions are more reliable for one group than for another. Conditional use accuracy equality addresses this by requiring equal NPV in addition to equal PPV.
PPV depends on the decision threshold. A model can satisfy PRP at one threshold and violate it at another. Audits should report PPV at multiple operating points or use threshold free measures such as PR AUC alongside the parity metric.
Predictive rate parity is purely statistical. It does not consider whether the disparity in predictions arises from a causal effect of the protected attribute or from a legitimate predictor that happens to correlate with it. Counterfactual fairness, introduced by Matt Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva at NeurIPS 2017, takes a causal view that PRP cannot represent.
When PRP becomes a regulatory or audit target, organizations have an incentive to optimize for the metric in ways that satisfy the letter but not the spirit. Selecting a non representative validation set, choosing favorable thresholds, or excluding marginal cases can all produce equal aggregate PPV without producing fair outcomes.
Consider a simplified loan approval setting with two equally sized groups A and B. Suppose the classifier produces the following confusion matrices.
| Quantity | Group A | Group B |
|---|---|---|
| True positives (TP) | 80 | 60 |
| False positives (FP) | 20 | 15 |
| Selected (TP + FP) | 100 | 75 |
| PPV (precision) | 0.80 | 0.80 |
| False negatives (FN) | 20 | 40 |
| True negatives (TN) | 80 | 85 |
| Actual positives (TP + FN) | 100 | 100 |
| TPR (recall) | 0.80 | 0.60 |
| Base rate | 0.50 | 0.50 |
In this example PPV is 0.80 in both groups, so predictive rate parity holds. However, the true positive rate (recall) is 0.80 in group A and 0.60 in group B, so equal opportunity and equalized odds are violated. The selection rate (100/200 vs 75/200) is also different, so demographic parity is violated. This illustrates how PRP can hold while other fairness metrics fail.
Imagine a teacher who gives gold stars to the students she thinks did well on a test. Predictive rate parity means that when she gives a gold star, she is right about the same percentage of the time for boys and for girls. If she gives gold stars to 10 boys and 8 of them really did well (80% correct), then when she gives gold stars to 10 girls, about 8 of them should really have done well too (also 80% correct).
If the gold stars are right 80% of the time for boys but only 60% of the time for girls, that is a problem. The gold star means something different depending on whether you are a boy or a girl. Predictive rate parity says the gold star should be worth the same for everyone.
The tricky part is that making the gold stars equally accurate for boys and girls might mean the teacher makes more mistakes in other ways. For example, she might miss more girls who really did well (and give them no star when they deserve one). That is the trade-off that mathematicians have proven is unavoidable when the two groups have different average scores.