Equalized odds is a group fairness criterion in machine learning that requires a classifier's true positive rate (TPR) and false positive rate (FPR) to be equal across all groups defined by a protected attribute. Introduced by Moritz Hardt, Eric Price, and Nathan Srebro in their 2016 paper "Equality of Opportunity in Supervised Learning," equalized odds has become one of the most widely referenced definitions in algorithmic fairness. The criterion formalizes the idea that a predictor's errors should not be correlated with membership in a protected group, conditional on the true outcome.
Imagine a teacher giving a test and then checking the answers. Equalized odds means the teacher is equally good at grading papers from every student, no matter what group the student belongs to. If the teacher accidentally marks a right answer as wrong, that mistake should happen just as often for boys as for girls. And if the teacher accidentally marks a wrong answer as right, that mistake should also happen at the same rate for everyone. The rule does not say every student must get the same grade; it says the teacher's mistakes should be spread out evenly so no group of students gets treated unfairly by the grading process.
The concept of equalized odds emerged from growing concerns about discrimination in automated decision-making systems. By the mid-2010s, machine learning models were being deployed at scale in areas such as criminal justice, lending, hiring, and healthcare. Researchers and journalists documented cases where these systems produced outcomes that were systematically worse for certain demographic groups.
One widely discussed example was the COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) recidivism prediction tool. A 2016 ProPublica investigation found that COMPAS assigned higher risk scores to Black defendants who did not go on to reoffend at nearly twice the rate of white defendants (45% versus 23% false positive rate). This disparity motivated the search for formal criteria that could quantify and constrain such imbalances.
Hardt, Price, and Srebro presented equalized odds at the 30th Conference on Neural Information Processing Systems (NeurIPS 2016). Their paper proposed both the formal definition and a practical post-processing algorithm for enforcing it. The paper also introduced the related (and weaker) notion of equal opportunity, which requires parity only for the true positive rate. The work included a case study on FICO credit score data, demonstrating that equalized odds could be approximately satisfied with a modest reduction in overall accuracy.
Let Y denote the true binary label, A denote the protected attribute (for example, race or gender), and R denote the classifier's predicted label. A predictor R satisfies equalized odds with respect to A and Y if:
R is conditionally independent of A given Y.
In probabilistic notation:
P(R = 1 | Y = y, A = a) = P(R = 1 | Y = y, A = a') for all y in {0, 1} and all values a, a' of A
Breaking this into its two component conditions:
Equivalently, equalized odds can be stated using the conditional independence notation:
R ⊥ A | Y
This means the predicted label and the protected attribute are statistically independent once the true label is known.
The equal TPR condition ensures that qualified individuals (Y = 1) from all groups have the same probability of receiving a positive prediction. In a loan approval context, this means creditworthy applicants should be approved at the same rate regardless of race or gender.
The equal FPR condition ensures that unqualified individuals (Y = 0) from all groups face the same probability of receiving a false positive. In the same loan context, this means non-creditworthy applicants should not be approved at systematically different rates based on their protected attribute.
Together, these two conditions mean that the classifier's errors are distributed equally across groups at every level of the true outcome.
Equalized odds sits within a larger family of statistical fairness definitions. Understanding the relationships and tensions among these criteria is necessary for selecting the right metric for a given application.
Equal opportunity is a relaxation of equalized odds proposed in the same paper by Hardt, Price, and Srebro. It requires only that the true positive rates be equal across groups:
P(R = 1 | Y = 1, A = a) = P(R = 1 | Y = 1, A = a') for all a, a'
Equal opportunity drops the false positive rate constraint, making it a strictly weaker condition. In practice, equal opportunity is often easier to achieve and may be preferred in settings where the cost of a false negative is much higher than the cost of a false positive (for example, in medical screening, where failing to detect a disease is the primary concern).
Demographic parity (also called statistical parity) requires that the overall positive prediction rate be the same across groups:
P(R = 1 | A = a) = P(R = 1 | A = a') for all a, a'
Unlike equalized odds, demographic parity does not condition on the true label Y. This means it can be satisfied by a classifier that is equally inaccurate across groups. Demographic parity is most appropriate when there is reason to believe the labels themselves are biased. It is incompatible with equalized odds whenever the base rates P(Y = 1 | A = a) differ between groups.
Predictive parity (also called calibration across groups or conditional use accuracy equality) requires equal positive predictive values across groups:
P(Y = 1 | R = 1, A = a) = P(Y = 1 | R = 1, A = a') for all a, a'
This was the fairness notion cited by Northpointe (the maker of COMPAS) to argue that their tool was fair, since it assigned similar recidivism probabilities to defendants with similar risk scores regardless of race.
| Fairness criterion | Conditions on | Formal requirement | Key property |
|---|---|---|---|
| Equalized odds | TPR and FPR | R ⊥ A | Y | Equal error rates across groups |
| Equal opportunity | TPR only | P(R=1|Y=1,A=a) equal for all a | Relaxation of equalized odds |
| Demographic parity | Positive prediction rate | R ⊥ A | Does not condition on true label |
| Predictive parity | Positive predictive value | Y ⊥ A | R=1 | Equal precision across groups |
| Calibration | Score-level | P(Y=1|S=s,A=a) = s for all a,s | Predicted probabilities match outcomes |
Equalized odds is a group fairness criterion, meaning it concerns aggregate statistics across demographic groups rather than outcomes for specific individuals. Individual fairness, introduced by Dwork et al. (2012), requires that similar individuals receive similar predictions. Counterfactual fairness, proposed by Kusner et al. (2017), asks whether a prediction would change if a person's protected attribute were different while everything else remained the same.
These paradigms operate at different levels of analysis. Group fairness metrics like equalized odds can be satisfied while individual cases remain unfair, and vice versa. Recent work has shown that under certain causal assumptions, satisfying equalized odds can imply a form of counterfactual fairness, but the relationship is not guaranteed in general.
A set of results from 2016 and 2017 established that the most common fairness criteria cannot all be satisfied simultaneously, except in trivial or degenerate cases. These impossibility theorems have shaped the understanding that fairness in machine learning involves unavoidable tradeoffs.
In "Inherent Trade-Offs in the Fair Determination of Risk Scores," Kleinberg, Mullainathan, and Raghavan proved that three natural fairness conditions (calibration, balance for the positive class, and balance for the negative class) cannot all hold simultaneously unless the groups have identical base rates or the predictor is perfect. Since balance for the positive and negative classes together constitute equalized odds, this result directly shows that equalized odds and calibration are incompatible in typical settings.
In "Fair Prediction with Disparate Impact," Alexandra Chouldechova demonstrated that when the prevalence (base rate) of the positive outcome differs between groups, it is impossible to simultaneously equalize false positive rates, false negative rates, and positive predictive values. This means equalized odds and predictive parity cannot coexist unless both groups have the same base rate.
In "On Fairness and Calibration," Pleiss, Raghavan, Wu, Kleinberg, and Weinberger further explored the tension between calibration and equalized odds. They showed that a calibrated classifier cannot also satisfy equalized odds (except in degenerate cases) and proposed a relaxation called "calibrated equalized odds" that partially reconciles these objectives by optimizing a cost-weighted combination of the two.
| Incompatible pair | Condition for compatibility | Source |
|---|---|---|
| Equalized odds + Calibration | Groups have identical base rates or predictor is perfect | Kleinberg, Mullainathan, Raghavan (2017) |
| Equalized odds + Predictive parity | Groups have identical base rates | Chouldechova (2017) |
| Equalized odds + Demographic parity | Groups have identical base rates | Follows from the definitions |
| Calibration + Equalized odds | Degenerate cases only | Pleiss et al. (2017) |
These results do not mean fairness is unachievable. They mean practitioners must choose which fairness criterion best fits their context and accept the tradeoffs that come with that choice.
Several approaches have been developed to train or modify classifiers so they satisfy (or approximately satisfy) equalized odds. These methods fall into three categories based on where in the machine learning pipeline they intervene.
The original approach proposed by Hardt, Price, and Srebro is a post-processing method. Given a trained classifier and knowledge of the protected attribute, the algorithm finds group-specific thresholds (or randomized thresholds) that equalize TPR and FPR across groups. The optimization can be formulated as a linear program with four variables and two equality constraints. The method works with any base classifier and does not require retraining, making it practical for deployment on existing models.
Pleiss et al. (2017) extended this with a calibrated equalized odds post-processor that balances equalized odds against calibration by optimizing a cost function over the classifier's score output.
Both methods are implemented in IBM's AI Fairness 360 (AIF360) toolkit.
In-processing methods modify the training procedure itself to incorporate fairness constraints.
Agarwal, Beygelzimer, Dudik, Langford, and Wallach (2018) introduced a reductions approach that converts fair classification into a sequence of cost-sensitive classification problems. Their exponentiated gradient algorithm iteratively solves a Lagrangian formulation to find the classifier that minimizes error while satisfying equalized odds constraints. This approach is implemented in Microsoft's Fairlearn library.
Adversarial debiasing methods train a primary classifier alongside an adversary that attempts to predict the protected attribute from the classifier's predictions. By penalizing the primary classifier when the adversary succeeds, the training process pushes toward predictions that are independent of the protected attribute conditional on the true label.
Pre-processing methods transform the training data before a classifier is trained. Techniques include:
Pre-processing approaches are model-agnostic but may not guarantee that the final classifier satisfies equalized odds exactly.
| Approach | Stage | Requires retraining | Access to protected attribute at inference | Theoretical guarantees |
|---|---|---|---|---|
| Post-processing (Hardt et al.) | After training | No | Yes | Exact equalized odds via LP |
| Calibrated equalized odds (Pleiss et al.) | After training | No | Yes | Approximate equalized odds with calibration |
| Reductions (Agarwal et al.) | During training | Yes | Yes (at training time) | Finite-sample guarantees |
| Adversarial debiasing | During training | Yes | Yes (at training time) | No exact guarantees |
| Re-sampling / Re-weighting | Before training | Yes | Yes (at data stage) | No exact guarantees |
The debate around the COMPAS recidivism prediction instrument is perhaps the most well-known example of equalized odds in practice and illustrates the tensions between competing fairness criteria.
In May 2016, ProPublica published an investigation titled "Machine Bias" analyzing COMPAS, a commercial tool used by courts across the United States to assess the likelihood that a criminal defendant would reoffend. ProPublica obtained risk scores for over 7,000 defendants in Broward County, Florida, and compared them against actual two-year recidivism outcomes.
Their key findings included:
These disparities represent a clear violation of equalized odds.
Northpointe (now Equivant), the company that developed COMPAS, responded by arguing that their tool satisfied a different fairness criterion: predictive parity. Defendants with the same risk score had similar recidivism rates regardless of race. As the impossibility theorems later formalized, both claims were correct simultaneously; COMPAS could satisfy predictive parity while violating equalized odds, precisely because the base rates of recidivism differed between the two groups.
This controversy brought academic fairness definitions into public discourse and highlighted the fact that "fairness" in machine learning is not a single concept but a family of related, sometimes mutually exclusive, criteria.
Equalized odds has been applied or proposed as a fairness measure in numerous domains.
Beyond COMPAS, risk assessment tools are used throughout the criminal justice system for bail decisions, sentencing recommendations, and parole evaluations. Equalized odds has been proposed as a standard for evaluating whether these tools treat defendants from different racial groups equitably in terms of error rates. Researchers have developed post-processing methods specifically designed for risk assessment instruments to satisfy equalized odds or its counterfactual variant.
In credit scoring, models predict whether a borrower will default on a loan. Equalized odds ensures that creditworthy borrowers from all demographic groups have the same chance of loan approval (equal TPR) and that non-creditworthy borrowers from all groups face the same chance of being incorrectly approved (equal FPR). The original Hardt et al. paper included experiments on FICO credit score data showing that equalized odds post-processing could produce fairer outcomes. Analyses of credit scoring datasets have found that enforcing equalized odds can reduce model utility (measured by Brier score improvements) by up to 20% relative to unconstrained baselines.
Medical diagnostic and prognostic models can exhibit disparities in performance across racial, ethnic, and gender groups. Equalized odds provides a framework for ensuring that a diagnostic model has the same sensitivity (TPR) and specificity (1 minus FPR) across patient groups. For example, a disease screening model satisfying equalized odds would detect the disease at the same rate for all demographic groups and would also produce false alarms at the same rate for all groups.
Automated resume screening and candidate ranking systems are increasingly used in hiring. Equalized odds applied to hiring would require that qualified candidates from all groups are selected at the same rate and that unqualified candidates from all groups are incorrectly selected at the same rate. This is relevant in light of cases such as Amazon's experimental AI recruiting tool (developed around 2014, discontinued by 2018), which was found to systematically downgrade resumes containing terms associated with women.
| Domain | Positive label (Y=1) | What equal TPR means | What equal FPR means |
|---|---|---|---|
| Criminal justice | Reoffends | Equal detection of actual recidivists across groups | Equal false alarm rate for non-recidivists across groups |
| Credit scoring | Defaults on loan | Equal detection of actual defaulters across groups | Equal false approval rate for non-defaulters across groups |
| Healthcare | Has disease | Equal sensitivity across patient groups | Equal false alarm rate across patient groups |
| Hiring | Qualified candidate | Equal selection rate for qualified candidates across groups | Equal false selection rate for unqualified candidates across groups |
Enforcing equalized odds typically comes at a cost to overall predictive accuracy. This tradeoff is a central practical concern when deploying fair classifiers.
Zhong and Xia (2024) studied the intrinsic fairness-accuracy tradeoffs under equalized odds and derived upper bounds on accuracy as a function of a fairness budget. Their theoretical bounds were validated empirically on the COMPAS, Adult Income, and Law School datasets. The results show that when there is significant statistical disparity in the outcome distribution across groups, the accuracy cost of equalized odds can be substantial.
In practice, the magnitude of the accuracy reduction depends on several factors:
These tradeoffs do not invalidate equalized odds as a fairness criterion. They do, however, underscore the importance of carefully evaluating whether the accuracy cost is acceptable in a given application context.
Several extensions of equalized odds have been proposed to address limitations of the original binary-attribute, binary-label formulation.
Woodworth, Gunasekar, Ohannessian, and Srebro (2017) extended equalized odds to settings with more than two outcome classes in "Learning Non-Discriminatory Predictors." Their work also studied the statistical and computational complexity of learning classifiers that satisfy equalized odds, showing that post-processing can be suboptimal and that the general learning problem under equalized odds constraints is computationally hard.
In practice, exact equalized odds may be unachievable or too costly. Approximate versions relax the equality constraints to allow small bounded differences in TPR and FPR across groups. The EU AI Act, which entered into force in August 2024, uses an "equalized odds gap" metric with a 10% tolerance threshold as one measure for assessing bias in high-risk AI systems.
A practical challenge with post-processing methods is that they require knowledge of the protected attribute at prediction time. Awasthi et al. (2020) studied equalized odds post-processing under imperfect group information, where group membership is predicted rather than observed. Their results show that post-processing with noisy group labels can still reduce unfairness, though the guarantees weaken as the group prediction accuracy decreases.
Combining equalized odds with causal inference ideas, counterfactual equalized odds requires that the classifier's error rates would remain equal across groups even in a hypothetical world where a person's protected attribute had been different. This variant addresses concerns that standard equalized odds may not capture the causal mechanisms of discrimination.
Several open-source libraries provide implementations of equalized odds metrics and enforcement algorithms.
| Tool | Developer | Language | Key equalized odds features |
|---|---|---|---|
| AI Fairness 360 (AIF360) | IBM | Python, R | Post-processing (Hardt et al.), Calibrated Equalized Odds (Pleiss et al.), metrics |
| Fairlearn | Microsoft | Python | Exponentiated gradient (Agarwal et al.), threshold optimizer, metrics |
| Google What-If Tool | Python, web | Interactive fairness exploration and equalized odds visualization | |
| Aequitas | Center for Data Science and Public Policy | Python | Bias audit toolkit with equalized odds metrics |
Equalized odds, despite its wide adoption, has several recognized limitations.
Reliance on group labels. Equalized odds requires defining discrete protected groups. In reality, identity is intersectional and continuous. Enforcing equalized odds on broad categories (for example, "race") may mask disparities within subgroups.
Dependence on label quality. Equalized odds conditions on the true label Y, but in many real-world settings the labels themselves may reflect historical bias. For instance, if arrest data is used as a proxy for criminal behavior, then Y itself encodes biased policing practices. Equalizing error rates with respect to a biased label may perpetuate rather than correct unfairness.
Incompatibility with other criteria. As the impossibility theorems demonstrate, equalized odds cannot be satisfied simultaneously with calibration or predictive parity when base rates differ. Choosing equalized odds means accepting violations of these other properties.
Accuracy costs. Enforcing equalized odds can reduce overall accuracy, particularly when base rates differ significantly between groups. The accuracy cost falls disproportionately on the larger group, since their predictions are adjusted more to match the smaller group's error rates.
Individual fairness concerns. Equalized odds is a group-level criterion. Two individuals with identical features but different group memberships may receive different predictions under an equalized odds classifier, which can conflict with intuitions about individual fairness.
Equalized odds and related fairness metrics are increasingly referenced in regulatory frameworks for AI.
The EU AI Act (Regulation (EU) 2024/1689), the first comprehensive legal framework for artificial intelligence, entered into force on August 1, 2024. Its prohibitions on certain AI practices became effective in February 2025, and full enforcement for high-risk AI systems begins in August 2026. The Act requires bias testing and monitoring for high-risk AI systems (including those used in credit scoring, hiring, and law enforcement), and regulatory guidance uses equalized odds gap as one of the metrics for assessing compliance.
In the United States, while no comprehensive federal AI fairness law exists as of early 2026, several sector-specific regulations and guidelines reference error rate parity concepts that align with equalized odds. The Consumer Financial Protection Bureau (CFPB) and the Equal Employment Opportunity Commission (EEOC) have both issued guidance on algorithmic fairness in their respective domains.