Counterfactual Fairness
Last reviewed
Jun 2, 2026
Sources
24 citations
Review status
Source-backed
Revision
v5 ยท 6,065 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
24 citations
Review status
Source-backed
Revision
v5 ยท 6,065 words
Add missing citations, update stale details, or suggest a clearer explanation.
Counterfactual fairness is a formal definition of algorithmic fairness rooted in causal inference. Introduced by Matt Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva in their 2017 paper presented at the Conference on Neural Information Processing Systems (NeurIPS), the concept defines a prediction as fair toward an individual if the prediction would remain the same in a counterfactual world where that individual belonged to a different demographic group.[1] Unlike statistical fairness criteria such as demographic parity or equalized odds, counterfactual fairness explicitly models the causal mechanisms through which protected attributes influence outcomes, drawing on Judea Pearl's framework of structural causal models (SCMs).[2] At the time of publication the four authors were affiliated with the Alan Turing Institute, with Kusner also at the University of Warwick, Loftus at New York University, Russell also at the University of Surrey, and Silva also at University College London.[1]
In the vocabulary of Pearl's causal hierarchy (also called the ladder of causation), counterfactual fairness is a counterfactual notion: it lives on the third and highest rung, which concerns questions of the form "what would Y have been for this individual had X been different, given what we actually observed." This places it above purely associational criteria such as demographic parity, which sit on the first (observational) rung, and above interventional do-operator criteria on the second rung. Computing the quantities required by counterfactual fairness therefore demands a fully specified structural causal model rather than only an observational or experimental distribution.[2][6]
The framework has become one of the most widely studied approaches to individual-level fairness in machine learning, with applications in hiring, lending, criminal justice, and healthcare. It has also sparked active debate about the relationship between causal and statistical fairness definitions.
Imagine a teacher is picking students for the school spelling bee. The teacher should choose students based on how well they can spell, not based on whether they are a boy or a girl. Counterfactual fairness asks a simple question: if we could magically change one thing about a student (like whether they are a boy or a girl) but keep everything else the same (how much they practiced, how many words they know), would the teacher still make the same choice? If the answer is yes, the decision is fair. If the answer is no, then the decision is being influenced by something it should not depend on.
Machine learning systems are increasingly used to make high-stakes decisions that affect people's lives, including loan approvals, hiring decisions, bail and sentencing recommendations, and medical diagnoses. These systems learn from historical data that often reflects decades or centuries of systemic discrimination. A hiring model trained on past hiring decisions, for example, may learn to penalize applicants from underrepresented groups simply because those groups were historically hired at lower rates, regardless of their actual qualifications.
Early approaches to algorithmic fairness focused on statistical definitions. Demographic parity requires that the proportion of positive outcomes be equal across groups. Equalized odds and the related equality of opportunity, introduced by Hardt, Price, and Srebro in 2016, require that a classifier have equal true positive and false positive rates across groups.[4] Individual fairness, proposed by Dwork et al. in 2012, requires that similar individuals receive similar predictions, but leaves the definition of "similar" as a task-specific design choice.[3] A still simpler baseline, fairness through unawareness, holds that an algorithm is fair so long as the protected attribute is not used explicitly as an input.[1]
These definitions, while useful, do not capture the causal pathways through which protected attributes influence predictions. A model that achieves demographic parity might still rely on proxies for race or gender. Equalized odds conditions on the true outcome, which may itself be tainted by historical bias. Individual fairness requires a similarity metric that can be difficult to specify in practice. A further well-known result by Kleinberg, Mullainathan, and Raghavan (2017) shows that several of these statistical criteria, including calibration and balance for the positive and negative classes, cannot in general be satisfied simultaneously except in degenerate cases, which means there is no purely statistical definition that captures every fairness intuition at once.[15]
Kusner et al. argued that fairness is inherently a causal concept. To determine whether a decision is fair, one must ask: would this decision have been different if the individual had belonged to a different group? Answering this question requires reasoning about counterfactuals, which in turn requires a causal model of how variables relate to one another.[1] A closely related causal perspective was developed independently by Kilbertus et al. (2017) in "Avoiding Discrimination through Causal Reasoning," which framed discrimination in terms of causal paths from the protected attribute to the prediction and introduced the notions of resolving variables (variables through which influence of the protected attribute is deemed acceptable) and proxy discrimination (influence that flows along a path blocked by a proxy for the protected attribute).[13]
Counterfactual fairness is defined within the framework of structural causal models (SCMs), a formalism developed by Judea Pearl.[2] An SCM is a mathematical object that represents the causal relationships among a set of variables. Following Pearl, Kusner et al. define a causal model as a triple (U, V, F) consisting of three components.[1]
Exogenous variables (U) are variables whose values are determined by factors outside the model. They represent unobserved background conditions, individual characteristics, or sources of randomness. In the fairness setting, exogenous variables often represent innate traits or abilities that are independent of a person's membership in a protected group.
Endogenous variables (V) are variables whose values are determined by other variables in the model through structural equations. In the fairness context, endogenous variables include the protected attribute A (such as race or gender), observed features X (such as test scores or work experience), and the outcome Y (such as whether someone is hired).
Structural equations (F) define how each endogenous variable is determined by its parents in the causal graph and by relevant exogenous variables. Each equation takes the form V_i = f_i(pa_i, U_pa_i), where pa_i denotes the direct causes (parents) of V_i in the causal graph.
The structural equations induce a directed acyclic graph (DAG) that represents the causal relationships among variables. Arrows in the graph point from causes to effects. The DAG makes explicit which variables are causally affected by the protected attribute, either directly or through intermediate variables.
A central feature of SCMs is their ability to answer counterfactual questions. Counterfactual reasoning in the Pearl framework follows a three-step procedure known as abduction-action-prediction.[14]
| Step | Name | Description |
|---|---|---|
| 1 | Abduction | Given the observed evidence (the actual values of all variables for an individual), update the probability distribution over the exogenous variables U. This step infers what the individual's background characteristics must be, given what we observe about them. |
| 2 | Action | Modify the structural equations by intervening on the variable of interest. In the fairness setting, this means setting the protected attribute A to a different value (for example, changing race from one group to another). |
| 3 | Prediction | Using the modified model with the updated exogenous variables, compute the values of all other variables under the intervention. This produces the counterfactual outcome: what would have happened to this specific individual if their protected attribute had been different. |
This three-step procedure distinguishes counterfactual queries from interventional queries. An intervention asks what would happen to a randomly selected individual if we changed their group membership. A counterfactual asks what would have happened to a specific observed individual, taking into account everything we know about them.
The formal definition of counterfactual fairness, stated as Definition 5 by Kusner et al. (2017), is as follows.[1]
A predictor Y-hat is counterfactually fair if, for any individual with observed features X = x and protected attribute A = a:
P(Y-hat_{A <- a}(U) = y | X = x, A = a) = P(Y-hat_{A <- a'}(U) = y | X = x, A = a)
for all y and for any value a' that A could take.
In plain language, this definition states that the distribution of predictions for an individual must be identical whether we consider the actual world (where A = a) or a counterfactual world (where A = a') while holding fixed the individual's background characteristics U. The conditioning on X = x and A = a reflects our knowledge of the individual from the observed data. The subscript A <- a' denotes the counterfactual operation of setting A to a'.
The definition operates at the individual level, not the group level. It does not require that aggregate statistics (such as acceptance rates) be equal across groups. Instead, it requires that each individual person would receive the same prediction regardless of which group they belong to, after accounting for the causal structure of the data.
Kusner et al. deliberately chose a counterfactual criterion over a weaker interventional one. An interventional constraint such as P(Y-hat = 1 | do(A = a)) = P(Y-hat = 1 | do(A = a')) only equalizes an average effect across the population. The authors argue that such a constraint "provides no guarantees against, for example, having half of the individuals being strongly negatively discriminated and half of the individuals strongly positively discriminated," because the two halves can cancel out in the average. The counterfactual definition rules out this kind of hidden balancing by requiring invariance for each individual.[1] The paper also notes that for the definition to behave sensibly, the set of protected attributes A should be closed under ancestral relationships in the causal graph: if race is protected, then a parent such as mother's race should be protected too.[1]
Kusner et al. proved a result that simplifies the implementation of counterfactual fairness in practice.[1]
Lemma 1: Let G be the causal graph of the model (U, V, F). Then a predictor Y-hat is counterfactually fair if it is a function only of the non-descendants of the protected attribute A in the graph.
This lemma provides a practical criterion: if the predictor uses only variables that are not causally affected by the protected attribute (either directly or indirectly), then it is automatically counterfactually fair. Variables that are descendants of A in the causal graph carry information that is causally influenced by group membership. Using such variables in a predictor can transmit the effect of the protected attribute into the prediction, even if the protected attribute is not used directly. The paper stresses that this does not strictly forbid using a descendant of A; it only requires that the overall dependence of the prediction on A vanish, which "will not happen in general," so restricting to non-descendants is simply the most direct way to satisfy the criterion.[1]
Kusner et al. described three levels at which counterfactual fairness can be implemented, with each level requiring progressively stronger causal assumptions but also allowing the use of more information for prediction.[1] Their accompanying FairLearning algorithm fits a predictor on the inferred background variables, using Markov chain Monte Carlo (implemented with the probabilistic programming language Stan) to draw samples of the latent variables when the relevant posterior cannot be computed in closed form.[1]
| Level | Approach | Causal assumptions | Information used | Practical notes |
|---|---|---|---|---|
| Level 1 | Use only observable non-descendants of A | Minimal: only requires knowledge of which observed variables are not descendants of A | Only observed variables that are causally unaffected by A | Simple to implement, but in many real-world problems most observed variables are descendants of protected attributes, leaving few usable features |
| Level 2 | Infer latent variables U from observed data using domain knowledge | Moderate: requires specifying a causal model with latent variables and learning conditional distributions P(U given X, A) | Latent "fair" variables inferred from the data, which capture individual characteristics independent of group membership | Requires explicit domain knowledge about the causal structure; latent variables act as debiased versions of observed features |
| Level 3 | Specify fully deterministic structural equations with additive error terms | Strong: requires a complete specification of structural equations, typically as additive noise models V_i = f_i(pa_i) + e_i | Error terms (residuals) that are independent of A by construction, capturing individual-specific variation not attributable to group membership | Maximizes the amount of information available for prediction; the error terms serve as counterfactually fair inputs to the predictor |
Level 3 extracts the most information from the data while maintaining counterfactual fairness, but it depends on the correctness of the assumed structural equations. If the causal model is misspecified, the resulting predictor may not be truly fair.
Kusner et al. demonstrated their framework using a dataset from the Law School Admission Council (LSAC), which contains records of 21,790 law students across 163 law schools in the United States.[1] The data was originally collected for the LSAC National Longitudinal Bar Passage Study (Wightman, 1998).[9]
The dataset includes the following key variables.
| Variable | Description | Role in the causal model |
|---|---|---|
| Race (R) | Student's race | Protected attribute |
| Sex (S) | Student's sex | Protected attribute |
| LSAT | Law School Admission Test score | Observed feature (descendant of A) |
| GPA | Undergraduate grade point average | Observed feature (descendant of A) |
| FYA | First-year law school average grade | Outcome variable (target) |
| Knowledge (K) | Latent variable representing a student's underlying knowledge and ability | Exogenous variable (not directly observed) |
The causal model posits that a student's latent knowledge K, together with their race and sex, causally determines their observable test scores (LSAT, GPA) and their law school performance (FYA). Race and sex affect LSAT, GPA, and FYA both through their influence on K (for example, through differential access to educational resources) and through direct effects (for example, through stereotype threat or grading bias).
The structural equations in the model are:
The key insight is that using raw LSAT and GPA scores to predict FYA is not counterfactually fair, because these scores are descendants of race and sex in the causal graph. They carry information about the causal effect of protected attributes on test performance.
Kusner et al. compared four approaches, reporting the root-mean-square error (RMSE) of a logistic-regression predictor on a held-out 20 percent test split.[1]
| Model | Description | RMSE | Counterfactually fair? |
|---|---|---|---|
| Full | Uses all variables including race and sex | 0.873 | No |
| Unaware | Uses LSAT and GPA but not race and sex directly | 0.894 | No (LSAT and GPA are descendants of A) |
| Fair K (Level 2) | Infers latent knowledge K and uses it for prediction | 0.929 | Yes |
| Fair Add (Level 3) | Uses additive error terms from structural equations | 0.918 | Yes |
The fair models achieve counterfactual fairness with a modest increase in prediction error. The "Unaware" model, which simply removes the protected attributes from the input, is not counterfactually fair because it still uses features (LSAT and GPA) that are causally influenced by race and sex. This illustrates why fairness through unawareness (simply removing sensitive attributes) is generally insufficient.[1] Because the authors judged that LSAT, GPA, and FYA are all biased by race and sex, none of the observed features qualify as non-descendants of the protected attributes, so a Level 1 predictor could not be built for this dataset and only the Level 2 (Fair K) and Level 3 (Fair Add) models were evaluated.[1]
Counterfactual fairness occupies a specific position in the broader landscape of fairness definitions. The following table compares it with other commonly studied criteria.
| Fairness definition | Type | Key requirement | Uses causal model? | Limitations |
|---|---|---|---|---|
| Demographic parity | Group | Equal positive prediction rates across groups | No | May require different thresholds for different groups; ignores differences in base rates [1] |
| Equalized odds / equality of opportunity | Group | Equal true positive and false positive rates across groups | No | Conditions on the true label, which may itself be biased [4] |
| Individual fairness | Individual | Similar individuals receive similar predictions | No (requires a similarity metric) | Defining the similarity metric is challenging and task-specific [3] |
| Fairness through unawareness | Individual | Protected attribute not used as an explicit input | No | Proxies in the remaining features can still carry the protected attribute's influence [1] |
| Counterfactual fairness | Individual (causal) | Prediction unchanged under counterfactual change of protected attribute | Yes | Requires specification and correctness of a causal model [1] |
| Calibration | Group | Among individuals assigned probability p, the fraction of positives is p, regardless of group | No | Can be satisfied by a biased model if base rates differ [15] |
| Path-specific counterfactual fairness | Individual (causal) | Only effects through "unfair" causal pathways are blocked | Yes | Requires distinguishing fair from unfair causal pathways [7] |
In 2023, Lucas Rosenblatt and R. Teal Witter, both at New York University, published a paper titled "Counterfactual Fairness Is Basically Demographic Parity" at the AAAI Conference on Artificial Intelligence. They argued that any algorithm satisfying counterfactual fairness also satisfies demographic parity, and that any algorithm satisfying demographic parity can be trivially modified to satisfy counterfactual fairness. Their empirical analysis compared three existing counterfactual fairness algorithms against three simple benchmarks and found that two of the benchmark algorithms outperformed all three existing algorithms in terms of fairness, accuracy, and efficiency on several datasets. They also proposed preserving the order of individuals within protected groups as a concrete, transparent fairness goal.[5]
In 2024, Ricardo Silva (one of the original authors of the counterfactual fairness paper) published a rebuttal titled "Counterfactual Fairness Is Not Demographic Parity, and Other Observations." Silva's central argument frames the disagreement in terms of Pearl's ladder: counterfactual fairness is an individual-level notion on Rung 3, whereas demographic parity is a non-causal group notion on Rung 1, so a blanket equivalence between the two cannot hold. He concedes that there are special cases under strong causal assumptions where the two notions coincide, but calls this an "inconsequential equivalence that relies on narrow Rung 3 conditions." His key technical point is that because different causal models can be observationally and interventionally indistinguishable yet imply different sets of counterfactually fair predictors, for any predictor that satisfies demographic parity "an adversary can pick a particular world that explains the data but in which it is not counterfactually fair." Silva attributes the error in the equivalence proof to conflating a mathematical construction with a causal assumption.[6] The debate remains active and highlights the subtle relationship between causal and statistical notions of fairness.
A notable extension is path-specific counterfactual fairness, introduced by Silvia Chiappa (then at DeepMind) at AAAI 2019. Standard counterfactual fairness requires that the prediction be invariant to any change in the protected attribute. However, in some settings certain causal pathways from the protected attribute to the outcome may be considered acceptable, while others are not. Chiappa's method corrects only the observations adversely affected along unfair pathways and uses a variational-autoencoder-style model so that it can apply to complex nonlinear settings without discarding as much individual-specific information as restricting to non-descendants would.[7] The original counterfactual fairness paper itself anticipated this direction, noting that "it is desirable to define path-specific variations of counterfactual fairness that allow for the inclusion of some descendants of A."[1]
For example, consider a university admissions decision where the protected attribute is socioeconomic status. The effect of socioeconomic status on admission through educational quality (attending under-resourced schools) might be considered unfair, while the effect through genuine effort and talent development might be considered acceptable. Path-specific approaches allow practitioners to specify which causal pathways are "fair" and which are "unfair," and require invariance only along the unfair pathways.
This family of methods provides greater flexibility but also introduces additional complexity, since the practitioner must make normative judgments about which pathways constitute fair versus unfair influence. A closely related formulation by Razieh Nabi and Ilya Shpitser, "Fair Inference on Outcomes" (AAAI 2018), characterizes discrimination as the presence of an effect of the protected attribute on the outcome along impermissible causal pathways and learns a fair model by solving a constrained optimization problem that bounds those path-specific effects.[16]
In 2019, Yongkai Wu, Lu Zhang, Xintao Wu, and Hanghang Tong proposed path-specific counterfactual fairness in a unified form they call PC-fairness, presented at NeurIPS 2019. PC-fairness is a general notion parameterized by a profile of causal paths and a context, and the authors show that it subsumes many earlier causality-based fairness notions, including total causal fairness, direct and indirect discrimination, and the original counterfactual fairness, as special cases. Their main technical contribution addresses identifiability: when the path-specific counterfactual quantity cannot be uniquely determined from observational data, they formulate a constrained optimization problem to compute upper and lower bounds on the degree of unfairness.[17] Quantifying how observed disparities decompose along these pathways was also formalized by Junzhe Zhang and Elias Bareinboim in "Fairness in Decision-Making: The Causal Explanation Formula" (AAAI 2018), which separates a total observed disparity into counterfactual direct, indirect, and spurious components.[18]
Counterfactual fairness has been studied in several application domains where algorithmic decisions have significant consequences for individuals.
Recidivism prediction tools such as the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) system have faced criticism for racial bias. ProPublica's 2016 investigation found that Black defendants were significantly more likely than white defendants to be incorrectly classified as high risk for reoffending.[12] Counterfactual fairness provides a framework for evaluating whether a risk assessment tool's prediction for a defendant would change if that defendant's race were counterfactually altered while their background circumstances remained the same. Applying counterfactual fairness to such systems requires a causal model of how race, socioeconomic factors, prior criminal history, and recidivism are related.
Fair lending laws in the United States (such as the Equal Credit Opportunity Act and the Fair Housing Act) prohibit discrimination in credit decisions on the basis of race, sex, religion, and other protected characteristics. Machine learning models used for credit scoring may inadvertently discriminate through proxy variables such as zip code, which correlate with race due to historical patterns of residential segregation. A counterfactually fair credit scoring model would produce the same credit decision for an applicant regardless of their race, after accounting for the causal relationships between race, socioeconomic factors, and creditworthiness indicators.
Automated resume screening and hiring recommendation systems may perpetuate historical biases present in training data. Counterfactual fairness in this context asks whether a hiring decision would remain the same if a candidate's gender (or other protected attribute) were changed while their skills, qualifications, and experience remained causally the same. This requires distinguishing between qualifications that are genuinely job-relevant and those that are artifacts of historical discrimination.
Clinical decision support systems that use patient data may produce different recommendations for patients of different racial or ethnic groups, even when clinical indicators are similar. Counterfactual fairness can be applied to ensure that treatment recommendations are based on clinical need rather than on features that are causally influenced by race or ethnicity. Obermeyer et al. (2019) documented a widely used healthcare algorithm that exhibited racial bias in identifying patients who need extra care, finding that the algorithm used health-care costs as a proxy for health needs even though less money is spent on Black patients with the same level of need, which underscores the importance of fairness-aware approaches in medical AI.[11]
Counterfactual fairness offers several properties that distinguish it from purely statistical fairness definitions.
Causal grounding. By requiring a causal model, counterfactual fairness forces practitioners to make their assumptions about the data-generating process explicit. This transparency can reveal hidden sources of bias that statistical approaches might miss.
Individual-level fairness. The definition operates at the level of individual predictions rather than group-level statistics. This means it can detect unfairness even when aggregate statistics appear balanced.
Handling proxy discrimination. Because counterfactual fairness traces causal pathways, it can identify cases where a model discriminates through proxy variables (such as zip code as a proxy for race), even when the protected attribute is not directly used as an input.
Principled treatment of descendants. The framework provides a clear criterion for which variables can be safely used in prediction (non-descendants of A) and which carry the causal influence of the protected attribute.
Despite its theoretical appeal, counterfactual fairness faces several practical challenges.
Causal model specification. The definition requires a structural causal model, but the true causal model is rarely known. Constructing a causal model requires domain expertise and involves assumptions that may be contested. Different stakeholders may disagree about the correct causal graph, and there is no purely data-driven way to resolve such disagreements. Kusner et al. acknowledge this directly, noting that structural equations are "in general unfalsifiable even if interventional data for all variables is available," because infinitely many structural equations can be compatible with the same observational or interventional distribution, so a counterfactual model should be treated as a provisional conjecture open to revision.[1]
Identifiability. Counterfactual quantities are not always identifiable from observational data alone. In some settings, the causal effect of the protected attribute on the outcome cannot be uniquely determined without additional assumptions (such as the absence of hidden confounders) or experimental data.
Model misspecification. If the assumed causal model is incorrect, a predictor that appears counterfactually fair under the assumed model may not be fair under the true causal model. Sensitivity analysis methods can help assess how robust conclusions are to model misspecification, but they do not eliminate the problem.
Scalability. Computing counterfactuals in complex causal models with many variables can be computationally expensive. For large-scale machine learning systems with hundreds or thousands of features, constructing and fitting a full structural causal model may be impractical.
Few non-descendant features. In many real-world problems, most observed features are descendants of protected attributes. Race and gender, for example, causally influence education, income, neighborhood, health outcomes, and many other commonly used features. At Level 1 (using only non-descendant features), this may leave very few usable features for prediction.
Ontological questions about protected attributes. The framework treats protected attributes as variables that can be "set" to different values in a counterfactual world. Some scholars have questioned whether it is coherent to ask what would have happened if a person had been a different race, given that race is deeply intertwined with life experience, identity, and social context. Atoosa Kasirzadeh and Andrew Smart (FAccT 2021) develop this objection in detail, reviewing the philosophy of social ontology and the semantics of counterfactuals and arguing that the counterfactual approach "can require an incoherent theory of what social categories such as race are," because such categories often "may not admit counterfactual manipulation." They build on Issa Kohler-Hausmann's argument that a manipulationist causal model misrepresents how social categories like race produce disparities, and that a constitutive account, rather than a counterfactual one, may be the appropriate formal model.[23][24] Defenders of the approach reply that, provided the structural causal model is correctly specified, generating the counterfactual automatically propagates the effect of changing the protected attribute onto every downstream feature, so the manipulation is not as naive as critics suggest. Kusner et al. themselves anticipated part of this debate, arguing that treating protected attributes as causes is more productive than declaring them outside the scope of causal analysis.[1]
Hidden confounders. Unobserved confounders that affect both the protected attribute and the outcome can lead to incorrect counterfactual estimates. If important variables are omitted from the causal model, the resulting fairness guarantees may be unreliable.
Several software packages and toolkits support the implementation of causal fairness methods, including counterfactual fairness.
| Tool | Language | Description |
|---|---|---|
| DoWhy | Python | A library for causal inference that includes support for counterfactual fairness evaluation, developed by Microsoft Research and now part of the PyWhy ecosystem |
| faircause | R | An R package implementing the causal fairness analysis framework of Plecko and Bareinboim (2024), including the Fairness Cookbook for decomposing observed disparities into direct, indirect, and spurious causal components [8] |
| AI Fairness 360 (AIF360) | Python, R | An extensible toolkit developed by IBM Research for detecting and mitigating algorithmic bias, including multiple fairness metrics and bias mitigation algorithms [10] |
| Fairlearn | Python | A Microsoft toolkit for assessing and improving fairness of machine learning models, supporting multiple fairness definitions and mitigation techniques |
Research on counterfactual fairness has expanded in several directions since the original 2017 paper.
Deep learning extensions. Researchers have developed methods for enforcing counterfactual fairness in deep neural networks. The Generative Counterfactual Fairness Network (GCFN), proposed by Yuchen Ma, Valentyn Melnychuk, Dennis Frauen, and Stefan Feuerriegel, uses a tailored generative adversarial network (GAN) to directly learn the counterfactual distribution of the descendants of the protected attribute and then enforces fair predictions through a counterfactual mediator regularization term. The authors prove that, if the counterfactual distribution is learned sufficiently well, the method is guaranteed to satisfy counterfactual fairness, which they describe as the first such theoretical guarantee for a neural prediction method of this kind.[19]
Graph neural networks. Counterfactual fairness has been extended to graph neural networks (GNNs), where the graph structure itself can encode discriminatory patterns. Methods such as contrastive learning on counterfactual graph augmentations aim to produce fair node representations that are invariant to changes in sensitive attributes.
Counterfactual fairness with imperfect models. Recognizing that perfect causal models are rarely available, researchers have developed methods for achieving approximate counterfactual fairness when the structural causal model is imperfect or partially specified. These methods aim to provide fairness guarantees that degrade gracefully with the degree of model misspecification.
Partial identification. When counterfactual quantities are not point-identified from observational data, partial identification approaches compute bounds on the degree of counterfactual unfairness. The PC-fairness framework of Wu et al. (2019) is a representative example, casting the bounding problem as a constrained optimization that returns the maximum and minimum possible unfairness consistent with the data.[17] This allows practitioners to assess fairness even when the causal model does not fully determine counterfactual outcomes.
Counterfactual fairness with a partially known graph. A related line of work relaxes the requirement that the full causal graph be specified in advance, instead assuming only a partially known structure (for example a class of graphs consistent with background knowledge) and deriving the predictions that remain counterfactually fair across that class.[20]
Combining factual and counterfactual predictions. A 2024 NeurIPS paper explored methods for combining factual predictions (based on observed data) with counterfactual predictions (based on counterfactual reasoning) to achieve a balance between predictive accuracy and counterfactual fairness.
Comprehensive causal fairness frameworks. Drago Plecko and Elias Bareinboim (2024) published a comprehensive monograph on causal fairness analysis in Foundations and Trends in Machine Learning. Their framework introduces the Standard Fairness Model (a causal-diagram template that requires fewer modeling assumptions than a fully specified graph) and the Fairness Map, which links observed disparities to underlying causal mechanisms. It culminates in a practical procedure they call the Fairness Cookbook, accompanied by the faircause R package.[8] The breadth of causality-based fairness notions has prompted survey work as well: Makhlouf, Zhioua, and Palamidessi (2024) catalogue the main causal fairness definitions and rank them by where they sit on Pearl's ladder of causation as a guide to how difficult each is to deploy in practice.[21]
Robustness to the choice of causal world. Recognizing that the correct causal model is usually uncertain, Russell, Kusner, Loftus, and Silva proposed "When Worlds Collide: Integrating Different Counterfactual Assumptions in Fairness" (NeurIPS 2017), which shows how to learn a predictor that is approximately counterfactually fair simultaneously across several competing causal models, so that the decision is defensible regardless of which model turns out to be correct.[22]