# Fairness Metric

> Source: https://aiwiki.ai/wiki/fairness_metric
> Updated: 2026-06-21
> Categories: AI Ethics, Machine Learning, Model Evaluation
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **fairness metric** is a quantitative, mathematical measure used to evaluate whether a [machine learning](/wiki/machine_learning) model's predictions or decisions treat different demographic groups equitably. Fairness metrics provide formal criteria that let researchers audit [algorithms](/wiki/algorithm) for [bias](/wiki/bias), compare how different models behave across protected groups, and guide the design of bias mitigation strategies. The most widely used metrics include [demographic parity](/wiki/demographic_parity), [equalized odds](/wiki/equalized_odds), equal opportunity, predictive parity, [calibration](/wiki/calibration), and the [disparate impact](/wiki/disparate_impact) ratio (the four-fifths or 80% rule). A foundational result in the field is that several of these criteria are mutually incompatible: when base rates differ between groups, no imperfect classifier can satisfy them all at once, so selecting a metric is an inherently normative choice that depends on the application domain, the type of harm being measured, and the values of the stakeholders involved.[7]

Fairness metrics have become a central topic in responsible AI research, driven in part by high-profile controversies such as the 2016 ProPublica investigation of the COMPAS recidivism prediction tool, in which Black defendants were flagged as high risk at a false positive rate of 44.9% compared with 23.5% for white defendants, and by growing regulatory attention from bodies like the U.S. Equal Employment Opportunity Commission (EEOC) and the European Union.[8][15]

## Explain like I'm 5 (ELI5)

Imagine a teacher is handing out gold stars to students. A fairness metric is like a rule that checks whether the teacher is giving stars fairly. For example, one rule might say "the same percentage of boys and girls should get stars." Another rule might say "if a boy and a girl both did great work, they should both get a star." These rules sometimes disagree with each other, and that is one of the tricky parts about fairness. In machine learning, computers make decisions about people (like who gets a loan or who gets called for a job interview), and fairness metrics are the rules we use to check whether those decisions are being made fairly for everyone.

## What is a fairness metric used for?

Fairness metrics serve three main purposes in the machine learning lifecycle. First, they let teams audit a deployed or candidate model by measuring how its outcomes differ across protected groups defined by attributes such as race, gender, or age. Second, they enable apples-to-apples comparison between models, or between a model and a human baseline. Third, they act as objectives or constraints during bias mitigation, telling an optimization procedure what "fair" should mean. Because the metrics formalize different and sometimes conflicting intuitions about fairness, the choice of metric effectively encodes a value judgment about which kind of error is least acceptable in a given context.

## Historical background

The intellectual roots of fairness metrics extend well beyond computer science. Quantitative fairness testing first emerged in the 1960s and 1970s, following the passage of the Civil Rights Act of 1964 in the United States. The landmark Supreme Court case *Griggs v. Duke Power Co.* (1971) established the legal concept of "disparate impact," ruling that employment practices with discriminatory effects violate Title VII of the Civil Rights Act even if the employer had no discriminatory intent.[14] The court found that aptitude tests used by Duke Power Company resulted in a pass rate of 58% for white applicants but only 6% for Black applicants, and that the tests bore no demonstrable relationship to job performance.[14]

Following *Griggs*, the EEOC and three other federal agencies (the Department of Labor, the Department of Justice, and the Civil Service Commission) codified the "four-fifths rule" (also known as the 80% rule) in the 1978 Uniform Guidelines on Employee Selection Procedures. The guidelines state that "a selection rate for any race, sex, or ethnic group which is less than four-fifths (4/5) (or eighty percent) of the rate for the group with the highest rate will generally be regarded by the Federal enforcement agencies as evidence of adverse impact."[15] Although the four-fifths rule is not a strict legal standard and has known statistical limitations, it has directly influenced the development of [disparate impact](/wiki/disparate_impact) metrics in machine learning.

Modern fairness metrics research in machine learning accelerated after 2012, when Dwork et al. introduced the concept of individual fairness in their paper "Fairness Through Awareness."[1] The field gained broader public attention in May 2016, when ProPublica published its "Machine Bias" investigation. ProPublica analyzed COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), a risk assessment tool developed by Northpointe (now Equivant) and used in U.S. courts to predict recidivism. Studying defendants in Broward County, Florida, ProPublica found that Black defendants were almost twice as likely as white defendants to be incorrectly flagged as high risk (a false positive rate of 44.9% versus 23.5%), while white defendants were more likely to be incorrectly labeled low risk despite going on to reoffend (a higher false negative rate).[8][15] Northpointe responded that its tool satisfied predictive parity, meaning that among defendants assigned the same risk score, Black and white defendants reoffended at similar rates.

Both sides were correct by their own chosen metric. This disagreement highlighted a fundamental tension that was soon formalized by Chouldechova (2017) and Kleinberg, Mullainathan, and Raghavan (2016) in what became known as the impossibility theorems of fairness.[3][4]

## Taxonomy of fairness metrics

Fairness metrics can be organized into several broad families based on what they measure and at what level of granularity they operate.

| Family | Level | Core idea | Key examples |
|---|---|---|---|
| Group fairness (independence) | Group | Predictions should be statistically independent of protected attributes | [Demographic parity](/wiki/demographic_parity), conditional statistical parity |
| Group fairness (separation) | Group | Prediction errors should be equal across groups, given the true outcome | [Equalized odds](/wiki/equalized_odds), equal opportunity, false positive rate balance |
| Group fairness (sufficiency) | Group | True outcomes should be independent of group membership, given the prediction | Predictive parity, [calibration](/wiki/calibration), conditional use accuracy equality |
| Individual fairness | Individual | Similar individuals should receive similar predictions | Fairness through awareness (Lipschitz condition) |
| Causal fairness | Individual/Structural | Outcomes should not change in counterfactual scenarios where group membership differs | [Counterfactual fairness](/wiki/counterfactual_fairness) |
| Subgroup fairness | Subgroup | Fairness should hold for intersections of protected attributes, not just single groups | Subgroup fairness (Kearns et al., 2018) |

The three group-fairness families above (independence, separation, and sufficiency) are often called the "big three" of statistical fairness, because nearly every group metric can be derived from one of them, and because the impossibility theorems are most cleanly stated in terms of these three categories.

## Group fairness metrics

Group fairness metrics compare the statistical behavior of a model's predictions across demographic groups defined by one or more protected attributes (such as race, gender, or age). These metrics can be further divided based on which statistical property they require to be equal across groups.

### Demographic parity (statistical parity)

Demographic parity requires that the probability of receiving a positive prediction is the same for all groups. Formally, a classifier h satisfies demographic parity if:

**P(h(X) = 1 | A = a) = P(h(X) = 1 | A = b)**

for all values a and b of the protected attribute A.

In other words, the selection rate (the fraction of individuals who receive a positive outcome) must be equal across groups, regardless of whether those individuals actually qualified for the positive outcome. Demographic parity corresponds to the statistical independence criterion: the prediction R and the sensitive attribute A are statistically independent (R is independent of A). The disparate impact ratio is the empirical, ratio-form version of demographic parity, and the four-fifths rule is a common threshold on that ratio.

Demographic parity is straightforward to compute and easy to interpret, but it has a significant limitation. It ignores the true label Y entirely, meaning it can be satisfied by a classifier that performs poorly on all groups as long as it assigns positive predictions at equal rates. It can also conflict with accuracy when base rates (the prevalence of the positive outcome) differ between groups.

### Equalized odds

Equalized odds requires that the model's [true positive rate](/wiki/recall) (TPR) and [false positive rate](/wiki/false_positive_rate_fpr) (FPR) are equal across groups. Formally:

**P(h(X) = 1 | Y = y, A = a) = P(h(X) = 1 | Y = y, A = b)**

for y in {0, 1} and all values a, b of the protected attribute A.

This criterion corresponds to the separation condition: the prediction R is independent of A given the true outcome Y (R is independent of A given Y). Equalized odds is stricter than demographic parity because it conditions on the true label, requiring that the model makes errors at equal rates across groups.[2] Hardt, Price, and Srebro, who introduced equalized odds in 2016, argued that it "enforces that the false positive rate and the true positive rate are the same across all groups," so that the predictor is independent of the protected attribute conditional on the true label.[2]

### Equal opportunity

Equal opportunity is a relaxation of equalized odds proposed by Hardt, Price, and Srebro (2016).[2] It requires only that the true positive rate be equal across groups:

**P(h(X) = 1 | Y = 1, A = a) = P(h(X) = 1 | Y = 1, A = b)**

This means that among individuals who actually deserve the positive outcome, each group should have the same chance of receiving it. Equal opportunity ignores the false positive rate, making it a less restrictive requirement than equalized odds but still stronger than demographic parity.

### Predictive parity

Predictive parity requires that the [positive predictive value](/wiki/precision) (PPV, or precision) be the same across groups:

**P(Y = 1 | h(X) = 1, A = a) = P(Y = 1 | h(X) = 1, A = b)**

In words, among individuals who receive a positive prediction, the proportion who truly belong to the positive class should be equal for all groups. This is the metric that Northpointe claimed COMPAS satisfied in response to ProPublica's critique.[4]

Predictive parity belongs to the sufficiency family: the true outcome Y is independent of A given the prediction R (Y is independent of A given R).

### Calibration (test fairness)

Calibration, sometimes called test fairness or well-calibration, extends predictive parity to risk scores rather than binary predictions. A model is calibrated across groups if, for any predicted probability score s:

**P(Y = 1 | S = s, A = a) = P(Y = 1 | S = s, A = b)**

where S is the model's outputted probability score. In a well-calibrated model, a predicted probability of 0.7 should correspond to an actual positive rate of approximately 70% regardless of group membership. Calibration is a sufficiency-based criterion.

### Additional group fairness metrics

Several other group fairness metrics appear in the literature:[7]

| Metric | Definition | Condition equalized across groups |
|---|---|---|
| False positive rate balance | P(h(X)=1 \| Y=0, A=a) = P(h(X)=1 \| Y=0, A=b) | False positive rate |
| False negative rate balance | P(h(X)=0 \| Y=1, A=a) = P(h(X)=0 \| Y=1, A=b) | [False negative rate](/wiki/false_negative_rate) |
| Overall accuracy equality | Accuracy(A=a) = Accuracy(A=b) | Overall [accuracy](/wiki/accuracy) |
| Treatment equality | FN/FP ratio for group a = FN/FP ratio for group b | Ratio of false negatives to false positives |
| Conditional use accuracy equality | PPV(A=a) = PPV(A=b) and NPV(A=a) = NPV(A=b) | Both positive and negative predictive values |
| Balance for the positive class | E[S \| Y=1, A=a] = E[S \| Y=1, A=b] | Mean predicted score among positive instances |
| Balance for the negative class | E[S \| Y=0, A=a] = E[S \| Y=0, A=b] | Mean predicted score among negative instances |

## Individual fairness

While group fairness metrics compare aggregate statistics across demographic groups, individual fairness focuses on whether a model treats each individual appropriately relative to other similar individuals.

### Fairness through awareness

The concept of individual fairness was formalized by Dwork, Hardt, Pitassi, Reingold, and Zemel in their 2012 paper "Fairness Through Awareness." Their definition is grounded in a Lipschitz condition: a mapping M from individuals to outcome distributions satisfies individual fairness if, for any two individuals x and y:

**D(M(x), M(y)) <= d(x, y)**

where D is a distance metric on outcome distributions (such as total variation distance or statistical distance) and d is a task-specific similarity metric on individuals. The inequality states that if two individuals are close in the feature space according to d, they must receive similar outcome distributions according to D.[1]

The main advantage of individual fairness is that it provides guarantees at the individual level rather than only at the group level. However, it requires defining an appropriate task-specific similarity metric d, which can be difficult in practice and may itself encode biases. The choice of d is domain-dependent and often requires expert knowledge or stakeholder input.

### Fairness through unawareness

A simpler and weaker notion is fairness through unawareness (FTU), which simply excludes protected attributes from the model's input features. While intuitive, FTU is widely considered insufficient because other features (such as zip code or surname) may serve as proxies for the protected attribute, allowing the model to reconstruct group membership indirectly. This phenomenon is known as redundant encoding.

## Causal fairness

### Counterfactual fairness

Counterfactual fairness, introduced by Kusner, Loftus, Russell, and Silva (2017), uses [causal inference](/wiki/causal_inference) to define fairness. A decision is counterfactually fair toward an individual if the decision would remain the same in a counterfactual world where the individual's protected attribute had been different, with all other non-descendant variables held at their observed values.[5]

Formally, a predictor h is counterfactually fair if:

**P(h(X)_{A <- a} = 1 | A = a, X = x) = P(h(X)_{A <- b} = 1 | A = a, X = x)**

for all attribute values a and b. Here, h(X)_{A <- a} denotes the prediction that would have been made if the protected attribute had been set to value a through intervention.

Counterfactual fairness requires constructing a causal model (a [directed acyclic graph](/wiki/directed_acyclic_graph)) that specifies the causal relationships between the protected attribute, other features, and the outcome. In practice, the predictor achieves counterfactual fairness by using only features that are non-descendants of the protected attribute in the causal graph.[5]

The causal approach addresses a key limitation of purely statistical fairness metrics: it distinguishes between legitimate and illegitimate uses of information that correlates with group membership. For example, in a hiring scenario, education level may correlate with race due to historical inequities, but whether this correlation constitutes unfairness depends on whether education is causally downstream of race in a way that the decision-maker considers illegitimate.

## Subgroup and intersectional fairness

Traditional group fairness metrics evaluate fairness with respect to a single protected attribute (e.g., race or gender). However, individuals belong to multiple demographic groups simultaneously, and a model that appears fair for each attribute in isolation may be unfair for intersectional subgroups (e.g., Black women).

Kearns, Neel, Roth, and Wu (2018) formalized this problem in their paper "Preventing Fairness Gerrymandering." They showed that a classifier can satisfy a fairness constraint for each protected group individually while violating it for structured subgroups formed by combinations of protected attributes. The authors proved that auditing subgroup fairness is computationally equivalent to the problem of weak agnostic learning, making it hard in the worst case even for simple subgroup classes.[6]

Intersectional fairness metrics extend standard group fairness definitions (such as demographic parity or equalized odds) to hold across a rich collection of subgroups rather than just the groups defined by a single attribute.

## Why can't all fairness metrics be satisfied at once?

One of the most significant theoretical results in the fairness metrics literature is that several commonly desired fairness properties cannot be satisfied simultaneously, except in trivial or degenerate cases. This is the core reason there is no universally "correct" fairness metric: the metrics encode genuinely different goals that pull against one another whenever the groups being compared have different base rates.

### Chouldechova's impossibility result (2017)

In "Fair Prediction with Disparate Impact," Alexandra Chouldechova proved that when the base rate (prevalence of the positive outcome) differs between two groups, no imperfect classifier can simultaneously satisfy:

1. Predictive parity (equal PPV across groups)
2. False positive rate balance (equal FPR across groups)
3. False negative rate balance (equal FNR across groups)

Chouldechova showed that error-rate balance and predictive parity "cannot all be simultaneously satisfied when the recidivism prevalence differs across groups."[4] The intuition is that when one group has a higher base rate, maintaining equal PPV forces the classifier to have different error rate distributions across groups. The only exceptions are a perfect classifier (which makes no errors) and the case where both groups have identical base rates.[4]

### Kleinberg, Mullainathan, and Raghavan's impossibility result (2016)

In "Inherent Trade-Offs in the Fair Determination of Risk Scores," Kleinberg, Mullainathan, and Raghavan formalized three conditions for risk score assignments:

1. **Calibration within groups:** For each group and each score bin, the expected fraction of individuals who belong to the positive class equals the assigned score.
2. **Balance for the positive class:** The average score assigned to positive-class individuals is the same across groups.
3. **Balance for the negative class:** The average score assigned to negative-class individuals is the same across groups.

They proved that "except in highly constrained special cases, there is no method that can satisfy these three conditions simultaneously," and added that even satisfying all three approximately requires the data to lie in an approximate version of those special cases (a perfect classifier or equal base rates across groups).[3]

### Practical implications

The impossibility theorems do not mean that pursuing fairness is futile. Instead, they establish that fairness is an inherently normative concept requiring practitioners to make explicit choices about which fairness criteria to prioritize. The appropriate choice depends on the specific application, the types of harm at stake, and the values of the affected communities. In criminal justice, for example, one might prioritize equal false positive rates (to avoid disproportionately punishing one group) over predictive parity. In lending, predictive parity might be preferred to ensure that approved applicants from all groups have similar repayment rates.

The following table summarizes the key impossibility results:

| Result | Year | Conditions shown to be incompatible | Exception cases |
|---|---|---|---|
| Chouldechova | 2017 | Predictive parity + FPR balance + FNR balance | Perfect classifier; equal base rates |
| Kleinberg, Mullainathan, Raghavan | 2016 | Calibration + balance for positive class + balance for negative class | Perfect classifier; equal base rates |
| General incompatibility | Various | Independence (demographic parity) + separation (equalized odds) + sufficiency (calibration) | Only two of the three can be satisfied simultaneously (unless degenerate) |

## How do fairness and accuracy trade off?

Enforcing fairness constraints typically comes at some cost to overall predictive accuracy. This trade-off arises because fairness constraints restrict the space of allowable classifiers, potentially excluding the most accurate model.

The magnitude of the trade-off depends on several factors:

- **Base rate differences:** When the positive class is more prevalent in one group, achieving equalized odds may require the model to sacrifice accuracy for one or both groups.
- **Feature informativeness:** If the protected attribute or its proxies carry predictive information about the outcome, removing this information reduces accuracy.
- **Sample size:** Smaller groups may have noisier estimates, making it harder to satisfy fairness constraints without sacrificing accuracy.

Researchers have studied the Pareto frontier of fairness-accuracy trade-offs, characterizing the best achievable accuracy for a given level of fairness and vice versa. In some settings, the trade-off is modest, and fair classifiers achieve accuracy close to unconstrained models. In other settings, especially when base rates differ substantially, the trade-off can be significant.

## Bias mitigation strategies

Fairness metrics are typically used in conjunction with bias mitigation algorithms, which can be categorized by where they intervene in the machine learning pipeline.[9]

### Pre-processing methods

Pre-processing methods modify the training data before it is fed to the learning algorithm. The goal is to remove or reduce correlations between the protected attribute and the features or labels.

| Method | Description |
|---|---|
| Reweighting | Assigns sample weights to balance the representation of different groups and outcomes. The weight for each sample is calculated as the ratio of the expected probability under independence to the observed probability. |
| Disparate impact remover | Transforms feature values to reduce their correlation with the protected attribute while preserving their rank ordering within each group. |
| Learning fair representations | Learns a new feature representation that is independent of the protected attribute while retaining information about the outcome (Zemel et al., 2013). |

### In-processing methods

In-processing methods incorporate fairness constraints directly into the model's training objective.

| Method | Description |
|---|---|
| Adversarial debiasing | Trains a predictor and an adversary simultaneously. The predictor tries to predict the outcome, while the adversary tries to predict the protected attribute from the predictor's output. The predictor is penalized when the adversary succeeds. |
| Prejudice remover | Adds a regularization term to the [loss function](/wiki/loss_function) that penalizes dependence between the prediction and the protected attribute (Kamishima et al., 2012). |
| Constrained optimization | Formulates fairness as an explicit constraint in the optimization problem and uses methods from constrained optimization to find the best model that satisfies the constraint. |
| Meta-fair classifier | Uses a meta-learning approach to find classifiers that optimize a chosen fairness metric. |

### Post-processing methods

Post-processing methods adjust the model's predictions after training to improve fairness.

| Method | Description |
|---|---|
| Equalized odds post-processing | Finds group-specific thresholds or randomization probabilities that satisfy equalized odds while maximizing accuracy (Hardt et al., 2016). |
| Calibrated equalized odds | Modifies predictions to satisfy a relaxed version of equalized odds that preserves calibration as much as possible. |
| Reject option classification | Gives favorable outcomes to unprivileged groups and unfavorable outcomes to privileged groups in the region near the decision boundary where the model is least certain (Kamiran et al., 2012). |

## Open-source tools and frameworks

Several software libraries implement fairness metrics and bias mitigation algorithms. IBM's AI Fairness 360 alone implements over 70 fairness metrics, illustrating how many distinct ways researchers have proposed to quantify fairness.[9]

| Tool | Developer | Language | Key features |
|---|---|---|---|
| [AI Fairness 360](https://aif360.res.ibm.com/) (AIF360) | IBM Research | Python, R | Over 70 fairness metrics and bias mitigation algorithms spanning pre-processing, in-processing, and post-processing |
| [Fairlearn](https://fairlearn.org/) | Microsoft | Python | Fairness assessment dashboard, mitigation algorithms including exponentiated gradient and threshold optimizer, regression support |
| [What-If Tool](https://pair-code.github.io/what-if-tool/) | Google PAIR | JavaScript/Python | Interactive visualization for exploring model behavior, supports TensorFlow and XGBoost models, Jupyter integration |
| Aequitas | University of Chicago | Python | Bias audit toolkit focused on group fairness metrics for classification models |
| [Responsible AI Toolbox](https://responsibleaitoolbox.ai/) | Microsoft | Python | Integrates fairness assessment with interpretability, error analysis, and causal inference |

## Regulatory and legal context

Fairness metrics have taken on increased importance as governments introduce regulations governing automated decision-making.

### United States

The EEOC applies existing anti-discrimination laws (Title VII, the Age Discrimination in Employment Act, and the Americans with Disabilities Act) to automated decision-making tools, including those powered by machine learning. In May 2023, the EEOC issued technical guidance reaffirming that employers using AI tools in hiring must ensure those tools do not produce disparate impact against protected groups. The four-fifths rule from the 1978 Uniform Guidelines remains a common initial screening tool, though researchers have noted that it is a rough heuristic and not a formal legal threshold for discrimination.[15]

### European Union

The EU AI Act, which entered into force on August 1, 2024, classifies AI systems used in employment, credit scoring, criminal justice, and other high-stakes domains as "high-risk." These systems must meet requirements for transparency, human oversight, and non-discrimination. Organizations deploying high-risk AI must conduct conformity assessments and maintain documentation demonstrating that their systems do not produce discriminatory outcomes. The Act's obligations for high-risk systems embedded in regulated products become fully applicable by August 2027.[13]

### Implications for metric selection

Regulatory frameworks generally do not prescribe specific fairness metrics but instead require that systems avoid discriminatory outcomes. This leaves practitioners with the responsibility of choosing metrics appropriate to their context. The choice should be guided by the type of decision being made, the potential harms of errors, the presence of base rate differences, and consultation with affected communities.

## Limitations and open challenges

Despite significant progress, fairness metrics face several unresolved challenges:

- **Metric selection:** The impossibility theorems show that no single metric can capture all dimensions of fairness. Choosing a metric is a normative decision that should involve stakeholders, but in practice, metric selection is often made by engineers without input from affected communities.
- **Proxy discrimination:** Even when a model satisfies a chosen fairness metric with respect to known protected attributes, it may discriminate based on proxies or unobserved attributes.
- **Dynamic and feedback effects:** Fairness metrics typically evaluate a model at a single point in time, but deployed models interact with the world and can create feedback loops. For example, a predictive policing model may direct more police to certain neighborhoods, leading to more arrests in those neighborhoods, which in turn reinforces the model's predictions.
- **Intersectionality:** Standard group fairness metrics consider one protected attribute at a time. Intersectional fairness requires evaluating fairness across combinations of attributes, which increases the number of subgroups exponentially and can make metric satisfaction computationally difficult.
- **Defining similarity:** Individual fairness requires a task-specific similarity metric, but defining this metric is often the hardest part of the problem. The choice of similarity metric can itself embed value judgments and biases.
- **Measurement and data quality:** Fairness metrics are only as reliable as the data used to compute them. Protected attribute labels may be missing, self-reported, or imputed, and outcome labels may reflect historical biases.
- **Beyond classification:** Most fairness metrics have been developed for binary classification. Extending them to regression, ranking, recommendation systems, and [generative AI](/wiki/generative_model) remains an active area of research.

## Which fairness metric should you use?

There is no default answer, but a few practical heuristics follow from the theory above. If the cost of a false positive falls heavily on a protected group (as in criminal risk scoring), equalizing false positive rates or using equalized odds tends to be the priority. If the goal is that a positive prediction means the same thing for everyone (as in lending, where an approved applicant should repay at a similar rate regardless of group), predictive parity or calibration is the natural target. If the legal frame is disparate impact (as in U.S. hiring), the demographic parity family and the four-fifths rule apply most directly. Because the impossibility theorems guarantee that these goals conflict whenever base rates differ, teams should state which metric they are optimizing, document why, and consult the affected communities rather than silently defaulting to whichever metric a library reports first.

## Key papers and timeline

The following table lists influential papers that have shaped the fairness metrics field:

| Year | Authors | Contribution |
|---|---|---|
| 2012 | Dwork, Hardt, Pitassi, Reingold, Zemel | Introduced individual fairness via the Lipschitz condition ("Fairness Through Awareness") |
| 2016 | Hardt, Price, Srebro | Defined equalized odds and equal opportunity; proposed post-processing methods |
| 2016 | Kleinberg, Mullainathan, Raghavan | Proved impossibility of simultaneously satisfying calibration and balance conditions |
| 2016 | ProPublica (Angwin, Larson, Mattu, Kirchner) | Published "Machine Bias" analysis of COMPAS recidivism tool |
| 2017 | Chouldechova | Proved impossibility of satisfying predictive parity and error rate balance simultaneously |
| 2017 | Kusner, Loftus, Russell, Silva | Introduced counterfactual fairness using causal inference |
| 2018 | Kearns, Neel, Roth, Wu | Formalized subgroup fairness and fairness gerrymandering |
| 2018 | Verma and Rubin | Provided a comprehensive taxonomy of fairness definitions ("Fairness Definitions Explained") |
| 2018 | IBM Research | Released AI Fairness 360 open-source toolkit |
| 2020 | Fairlearn team (Microsoft) | Released Fairlearn open-source toolkit |

## See also

- [Bias](/wiki/bias)
- [Algorithmic fairness](/wiki/algorithmic_fairness)
- [Demographic parity](/wiki/demographic_parity)
- [Equalized odds](/wiki/equalized_odds)
- [Disparate impact](/wiki/disparate_impact)
- [Disparate treatment](/wiki/disparate_treatment)
- [Counterfactual fairness](/wiki/counterfactual_fairness)
- [Confusion matrix](/wiki/confusion_matrix)
- [Precision](/wiki/precision)
- [Recall](/wiki/recall)
- [Accuracy](/wiki/accuracy)
- [Calibration](/wiki/calibration)

## References

1. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012). "Fairness Through Awareness." *Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (ITCS)*, 214-226.
2. Hardt, M., Price, E., & Srebro, N. (2016). "Equality of Opportunity in Supervised Learning." *Advances in Neural Information Processing Systems 29 (NeurIPS)*. https://home.ttic.edu/~nati/Publications/HardtPriceSrebro2016.pdf
3. Kleinberg, J., Mullainathan, S., & Raghavan, M. (2016). "Inherent Trade-Offs in the Fair Determination of Risk Scores." *Proceedings of the 8th Innovations in Theoretical Computer Science Conference (ITCS 2017)*. arXiv:1609.05807.
4. Chouldechova, A. (2017). "Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments." *Big Data*, 5(2), 153-163. arXiv:1703.00056.
5. Kusner, M. J., Loftus, J., Russell, C., & Silva, R. (2017). "Counterfactual Fairness." *Advances in Neural Information Processing Systems 30 (NeurIPS)*, 4066-4076.
6. Kearns, M., Neel, S., Roth, A., & Wu, Z. S. (2018). "Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness." *Proceedings of the 35th International Conference on Machine Learning (ICML)*, 2564-2572.
7. Verma, S., & Rubin, J. (2018). "Fairness Definitions Explained." *Proceedings of the International Workshop on Software Fairness (FairWare)*, 1-7.
8. Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). "Machine Bias." *ProPublica*. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
9. Bellamy, R. K. E., et al. (2018). "AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias." *IBM Research*. https://aif360.res.ibm.com/
10. Bird, S., et al. (2020). "Fairlearn: A Toolkit for Assessing and Improving Fairness in AI." *Microsoft Research*. https://fairlearn.org/
11. Kamishima, T., Akaho, S., Asoh, H., & Sakuma, J. (2012). "Fairness-Aware Classifier with Prejudice Remover Regularizer." *Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD)*, 35-50.
12. Zemel, R., Wu, Y., Swersky, K., Pitassi, T., & Dwork, C. (2013). "Learning Fair Representations." *Proceedings of the 30th International Conference on Machine Learning (ICML)*, 325-333.
13. Regulation (EU) 2024/1689 of the European Parliament and of the Council (EU AI Act). *Official Journal of the European Union*, L 2024/1689.
14. Griggs v. Duke Power Co., 401 U.S. 424 (1971).
15. Uniform Guidelines on Employee Selection Procedures (1978), 29 CFR Part 1607, U.S. Equal Employment Opportunity Commission (with Department of Labor, Department of Justice, and Civil Service Commission); COMPAS Broward County false positive rates from Larson, J., Mattu, S., Kirchner, L., & Angwin, J. (2016), "How We Analyzed the COMPAS Recidivism Algorithm," *ProPublica*. https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm
