Counterfactual fairness is a formal definition of algorithmic fairness rooted in causal inference. Introduced by Matt Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva in their 2017 paper presented at the Conference on Neural Information Processing Systems (NeurIPS), the concept defines a prediction as fair toward an individual if the prediction would remain the same in a counterfactual world where that individual belonged to a different demographic group. Unlike statistical fairness criteria such as demographic parity or equalized odds, counterfactual fairness explicitly models the causal mechanisms through which protected attributes influence outcomes, drawing on Judea Pearl's framework of structural causal models (SCMs).
The framework has become one of the most widely studied approaches to individual-level fairness in machine learning, with applications in hiring, lending, criminal justice, and healthcare. It has also sparked active debate about the relationship between causal and statistical fairness definitions.
Imagine a teacher is picking students for the school spelling bee. The teacher should choose students based on how well they can spell, not based on whether they are a boy or a girl. Counterfactual fairness asks a simple question: if we could magically change one thing about a student (like whether they are a boy or a girl) but keep everything else the same (how much they practiced, how many words they know), would the teacher still make the same choice? If the answer is yes, the decision is fair. If the answer is no, then the decision is being influenced by something it should not depend on.
Machine learning systems are increasingly used to make high-stakes decisions that affect people's lives, including loan approvals, hiring decisions, bail and sentencing recommendations, and medical diagnoses. These systems learn from historical data that often reflects decades or centuries of systemic discrimination. A hiring model trained on past hiring decisions, for example, may learn to penalize applicants from underrepresented groups simply because those groups were historically hired at lower rates, regardless of their actual qualifications.
Early approaches to algorithmic fairness focused on statistical definitions. Demographic parity requires that the proportion of positive outcomes be equal across groups. Equalized odds, introduced by Hardt, Price, and Srebro in 2016, requires that a classifier have equal true positive and false positive rates across groups. Individual fairness, proposed by Dwork et al. in 2012, requires that similar individuals receive similar predictions, but leaves the definition of "similar" as a task-specific design choice.
These definitions, while useful, do not capture the causal pathways through which protected attributes influence predictions. A model that achieves demographic parity might still rely on proxies for race or gender. Equalized odds conditions on the true outcome, which may itself be tainted by historical bias. Individual fairness requires a similarity metric that can be difficult to specify in practice.
Kusner et al. argued that fairness is inherently a causal concept. To determine whether a decision is fair, one must ask: would this decision have been different if the individual had belonged to a different group? Answering this question requires reasoning about counterfactuals, which in turn requires a causal model of how variables relate to one another.
Counterfactual fairness is defined within the framework of structural causal models (SCMs), a formalism developed by Judea Pearl. An SCM is a mathematical object that represents the causal relationships among a set of variables. Formally, an SCM is a triple (U, V, F) consisting of three components.
Exogenous variables (U) are variables whose values are determined by factors outside the model. They represent unobserved background conditions, individual characteristics, or sources of randomness. In the fairness setting, exogenous variables often represent innate traits or abilities that are independent of a person's membership in a protected group.
Endogenous variables (V) are variables whose values are determined by other variables in the model through structural equations. In the fairness context, endogenous variables include the protected attribute A (such as race or gender), observed features X (such as test scores or work experience), and the outcome Y (such as whether someone is hired).
Structural equations (F) define how each endogenous variable is determined by its parents in the causal graph and by relevant exogenous variables. Each equation takes the form V_i = f_i(pa_i, U_pa_i), where pa_i denotes the direct causes (parents) of V_i in the causal graph.
The structural equations induce a directed acyclic graph (DAG) that represents the causal relationships among variables. Arrows in the graph point from causes to effects. The DAG makes explicit which variables are causally affected by the protected attribute, either directly or through intermediate variables.
A central feature of SCMs is their ability to answer counterfactual questions. Counterfactual reasoning in the Pearl framework follows a three-step procedure known as abduction-action-prediction.
| Step | Name | Description |
|---|---|---|
| 1 | Abduction | Given the observed evidence (the actual values of all variables for an individual), update the probability distribution over the exogenous variables U. This step infers what the individual's background characteristics must be, given what we observe about them. |
| 2 | Action | Modify the structural equations by intervening on the variable of interest. In the fairness setting, this means setting the protected attribute A to a different value (for example, changing race from one group to another). |
| 3 | Prediction | Using the modified model with the updated exogenous variables, compute the values of all other variables under the intervention. This produces the counterfactual outcome: what would have happened to this specific individual if their protected attribute had been different. |
This three-step procedure distinguishes counterfactual queries from interventional queries. An intervention asks what would happen to a randomly selected individual if we changed their group membership. A counterfactual asks what would have happened to a specific observed individual, taking into account everything we know about them.
The formal definition of counterfactual fairness, as stated by Kusner et al. (2017), is as follows.
A predictor Y-hat is counterfactually fair if, for any individual with observed features X = x and protected attribute A = a:
P(Y-hat_{A <- a}(U) = y | X = x, A = a) = P(Y-hat_{A <- a'}(U) = y | X = x, A = a)
for all y and for any value a' that A could take.
In plain language, this definition states that the distribution of predictions for an individual must be identical whether we consider the actual world (where A = a) or a counterfactual world (where A = a') while holding fixed the individual's background characteristics U. The conditioning on X = x and A = a reflects our knowledge of the individual from the observed data. The subscript A <- a' denotes the counterfactual operation of setting A to a'.
The definition operates at the individual level, not the group level. It does not require that aggregate statistics (such as acceptance rates) be equal across groups. Instead, it requires that each individual person would receive the same prediction regardless of which group they belong to, after accounting for the causal structure of the data.
Kusner et al. proved a result that simplifies the implementation of counterfactual fairness in practice.
Lemma 1: A predictor Y-hat is counterfactually fair if it is a function only of the non-descendants of the protected attribute A in the causal graph.
This lemma provides a practical criterion: if the predictor uses only variables that are not causally affected by the protected attribute (either directly or indirectly), then it is automatically counterfactually fair. Variables that are descendants of A in the causal graph carry information that is causally influenced by group membership. Using such variables in a predictor can transmit the effect of the protected attribute into the prediction, even if the protected attribute is not used directly.
Kusner et al. described three levels at which counterfactual fairness can be implemented, with each level requiring progressively stronger causal assumptions but also allowing the use of more information for prediction.
| Level | Approach | Causal assumptions | Information used | Practical notes |
|---|---|---|---|---|
| Level 1 | Use only observable non-descendants of A | Minimal: only requires knowledge of which observed variables are not descendants of A | Only observed variables that are causally unaffected by A | Simple to implement, but in many real-world problems most observed variables are descendants of protected attributes, leaving few usable features |
| Level 2 | Infer latent variables U from observed data using domain knowledge | Moderate: requires specifying a causal model with latent variables and learning conditional distributions P(U given X, A) | Latent "fair" variables inferred from the data, which capture individual characteristics independent of group membership | Requires explicit domain knowledge about the causal structure; latent variables act as debiased versions of observed features |
| Level 3 | Specify fully deterministic structural equations with additive error terms | Strong: requires a complete specification of structural equations, typically as additive noise models V_i = f_i(pa_i) + e_i | Error terms (residuals) that are independent of A by construction, capturing individual-specific variation not attributable to group membership | Maximizes the amount of information available for prediction; the error terms serve as counterfactually fair inputs to the predictor |
Level 3 extracts the most information from the data while maintaining counterfactual fairness, but it depends on the correctness of the assumed structural equations. If the causal model is misspecified, the resulting predictor may not be truly fair.
Kusner et al. demonstrated their framework using a dataset from the Law School Admission Council (LSAC), which contains records of 21,790 law students across 163 law schools in the United States. The data was originally collected for the LSAC National Longitudinal Bar Passage Study (Wightman, 1998).
The dataset includes the following key variables.
| Variable | Description | Role in the causal model |
|---|---|---|
| Race (R) | Student's race | Protected attribute |
| Sex (S) | Student's sex | Protected attribute |
| LSAT | Law School Admission Test score | Observed feature (descendant of A) |
| GPA | Undergraduate grade point average | Observed feature (descendant of A) |
| FYA | First-year law school average grade | Outcome variable (target) |
| Knowledge (K) | Latent variable representing a student's underlying knowledge and ability | Exogenous variable (not directly observed) |
The causal model posits that a student's latent knowledge K, together with their race and sex, causally determines their observable test scores (LSAT, GPA) and their law school performance (FYA). Race and sex affect LSAT, GPA, and FYA both through their influence on K (for example, through differential access to educational resources) and through direct effects (for example, through stereotype threat or grading bias).
The structural equations in the model are:
The key insight is that using raw LSAT and GPA scores to predict FYA is not counterfactually fair, because these scores are descendants of race and sex in the causal graph. They carry information about the causal effect of protected attributes on test performance.
Kusner et al. compared four approaches.
| Model | Description | RMSE | Counterfactually fair? |
|---|---|---|---|
| Full | Uses all variables including race and sex | 0.873 | No |
| Unaware | Uses LSAT and GPA but not race and sex directly | 0.894 | No (LSAT and GPA are descendants of A) |
| Fair K (Level 2) | Infers latent knowledge K and uses it for prediction | 0.929 | Yes |
| Fair Add (Level 3) | Uses additive error terms from structural equations | 0.918 | Yes |
The fair models achieve counterfactual fairness with a modest increase in prediction error. The "Unaware" model, which simply removes the protected attributes from the input, is not counterfactually fair because it still uses features (LSAT and GPA) that are causally influenced by race and sex. This illustrates why fairness through unawareness (simply removing sensitive attributes) is generally insufficient.
Counterfactual fairness occupies a specific position in the broader landscape of fairness definitions. The following table compares it with other commonly studied criteria.
| Fairness definition | Type | Key requirement | Uses causal model? | Limitations |
|---|---|---|---|---|
| Demographic parity | Group | Equal positive prediction rates across groups | No | May require different thresholds for different groups; ignores differences in base rates |
| Equalized odds | Group | Equal true positive and false positive rates across groups | No | Conditions on the true label, which may itself be biased |
| Individual fairness | Individual | Similar individuals receive similar predictions | No (requires a similarity metric) | Defining the similarity metric is challenging and task-specific |
| Counterfactual fairness | Individual (causal) | Prediction unchanged under counterfactual change of protected attribute | Yes | Requires specification and correctness of a causal model |
| Calibration | Group | Among individuals assigned probability p, the fraction of positives is p, regardless of group | No | Can be satisfied by a biased model if base rates differ |
| Path-specific counterfactual fairness | Individual (causal) | Only effects through "unfair" causal pathways are blocked | Yes | Requires distinguishing fair from unfair causal pathways |
In 2023, Rosenblatt and Witter published a paper titled "Counterfactual Fairness Is Basically Demographic Parity" at the AAAI Conference on Artificial Intelligence. They argued that any algorithm satisfying counterfactual fairness also satisfies demographic parity, and that any algorithm satisfying demographic parity can be trivially modified to satisfy counterfactual fairness. Their empirical analysis found that simple benchmark algorithms outperformed existing counterfactual fairness algorithms in terms of fairness, accuracy, and efficiency on several datasets.
In 2024, Ricardo Silva (one of the original authors of the counterfactual fairness paper) published a rebuttal titled "Counterfactual Fairness Is Not Demographic Parity, and Other Observations." Silva argued that the equivalence claim does not hold under careful examination and cautioned against making blanket statements of equivalence between causal concepts and purely probabilistic concepts. The debate remains active and highlights the subtle relationship between causal and statistical notions of fairness.
A notable extension of counterfactual fairness is path-specific counterfactual fairness (PC fairness), introduced by Chiappa in 2019. Standard counterfactual fairness requires that the prediction be invariant to any change in the protected attribute. However, in some settings certain causal pathways from the protected attribute to the outcome may be considered acceptable, while others are not.
For example, consider a university admissions decision where the protected attribute is socioeconomic status. The effect of socioeconomic status on admission through educational quality (attending under-resourced schools) might be considered unfair, while the effect through genuine effort and talent development might be considered acceptable. Path-specific counterfactual fairness allows practitioners to specify which causal pathways are "fair" and which are "unfair," and requires invariance only along the unfair pathways.
This extension provides greater flexibility but also introduces additional complexity, since the practitioner must make normative judgments about which pathways constitute fair versus unfair influence. It subsumes several other causal fairness notions as special cases.
Counterfactual fairness has been studied in several application domains where algorithmic decisions have significant consequences for individuals.
Recidivism prediction tools such as the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) system have faced criticism for racial bias. ProPublica's 2016 investigation found that Black defendants were significantly more likely than white defendants to be incorrectly classified as high risk for reoffending. Counterfactual fairness provides a framework for evaluating whether a risk assessment tool's prediction for a defendant would change if that defendant's race were counterfactually altered while their background circumstances remained the same. Applying counterfactual fairness to such systems requires a causal model of how race, socioeconomic factors, prior criminal history, and recidivism are related.
Fair lending laws in the United States (such as the Equal Credit Opportunity Act and the Fair Housing Act) prohibit discrimination in credit decisions on the basis of race, sex, religion, and other protected characteristics. Machine learning models used for credit scoring may inadvertently discriminate through proxy variables such as zip code, which correlate with race due to historical patterns of residential segregation. A counterfactually fair credit scoring model would produce the same credit decision for an applicant regardless of their race, after accounting for the causal relationships between race, socioeconomic factors, and creditworthiness indicators.
Automated resume screening and hiring recommendation systems may perpetuate historical biases present in training data. Counterfactual fairness in this context asks whether a hiring decision would remain the same if a candidate's gender (or other protected attribute) were changed while their skills, qualifications, and experience remained causally the same. This requires distinguishing between qualifications that are genuinely job-relevant and those that are artifacts of historical discrimination.
Clinical decision support systems that use patient data may produce different recommendations for patients of different racial or ethnic groups, even when clinical indicators are similar. Counterfactual fairness can be applied to ensure that treatment recommendations are based on clinical need rather than on features that are causally influenced by race or ethnicity. Obermeyer et al. (2019) documented a widely used healthcare algorithm that exhibited racial bias in identifying patients who need extra care, underscoring the importance of fairness-aware approaches in medical AI.
Counterfactual fairness offers several properties that distinguish it from purely statistical fairness definitions.
Causal grounding. By requiring a causal model, counterfactual fairness forces practitioners to make their assumptions about the data-generating process explicit. This transparency can reveal hidden sources of bias that statistical approaches might miss.
Individual-level fairness. The definition operates at the level of individual predictions rather than group-level statistics. This means it can detect unfairness even when aggregate statistics appear balanced.
Handling proxy discrimination. Because counterfactual fairness traces causal pathways, it can identify cases where a model discriminates through proxy variables (such as zip code as a proxy for race), even when the protected attribute is not directly used as an input.
Principled treatment of descendants. The framework provides a clear criterion for which variables can be safely used in prediction (non-descendants of A) and which carry the causal influence of the protected attribute.
Despite its theoretical appeal, counterfactual fairness faces several practical challenges.
Causal model specification. The definition requires a structural causal model, but the true causal model is rarely known. Constructing a causal model requires domain expertise and involves assumptions that may be contested. Different stakeholders may disagree about the correct causal graph, and there is no purely data-driven way to resolve such disagreements.
Identifiability. Counterfactual quantities are not always identifiable from observational data alone. In some settings, the causal effect of the protected attribute on the outcome cannot be uniquely determined without additional assumptions (such as the absence of hidden confounders) or experimental data.
Model misspecification. If the assumed causal model is incorrect, a predictor that appears counterfactually fair under the assumed model may not be fair under the true causal model. Sensitivity analysis methods can help assess how robust conclusions are to model misspecification, but they do not eliminate the problem.
Scalability. Computing counterfactuals in complex causal models with many variables can be computationally expensive. For large-scale machine learning systems with hundreds or thousands of features, constructing and fitting a full structural causal model may be impractical.
Few non-descendant features. In many real-world problems, most observed features are descendants of protected attributes. Race and gender, for example, causally influence education, income, neighborhood, health outcomes, and many other commonly used features. At Level 1 (using only non-descendant features), this may leave very few usable features for prediction.
Ontological questions about protected attributes. The framework treats protected attributes as variables that can be "set" to different values in a counterfactual world. Some scholars have questioned whether it is coherent to ask what would have happened if a person had been a different race, given that race is deeply intertwined with life experience, identity, and social context.
Hidden confounders. Unobserved confounders that affect both the protected attribute and the outcome can lead to incorrect counterfactual estimates. If important variables are omitted from the causal model, the resulting fairness guarantees may be unreliable.
Several software packages and toolkits support the implementation of causal fairness methods, including counterfactual fairness.
| Tool | Language | Description |
|---|---|---|
| DoWhy | Python | A library for causal inference that includes support for counterfactual fairness evaluation, developed by Microsoft Research and now part of the PyWhy ecosystem |
| faircause | R | An R package implementing the causal fairness analysis framework of Plecko and Bareinboim (2024), including methods for decomposing observed disparities into causal components |
| AI Fairness 360 (AIF360) | Python, R | An extensible toolkit developed by IBM Research for detecting and mitigating algorithmic bias, including multiple fairness metrics and bias mitigation algorithms |
| Fairlearn | Python | A Microsoft toolkit for assessing and improving fairness of machine learning models, supporting multiple fairness definitions and mitigation techniques |
Research on counterfactual fairness has expanded in several directions since the original 2017 paper.
Deep learning extensions. Researchers have developed methods for enforcing counterfactual fairness in deep neural networks. The Generative Counterfactual Fairness Network (GCFN) uses generative adversarial networks (GANs) to learn the counterfactual distribution of descendants of the protected attribute and enforces fair predictions through a counterfactual mediator regularization term.
Graph neural networks. Counterfactual fairness has been extended to graph neural networks (GNNs), where the graph structure itself can encode discriminatory patterns. Methods such as contrastive learning on counterfactual graph augmentations aim to produce fair node representations that are invariant to changes in sensitive attributes.
Counterfactual fairness with imperfect models. Recognizing that perfect causal models are rarely available, researchers have developed methods for achieving approximate counterfactual fairness when the structural causal model is imperfect or partially specified. These methods aim to provide fairness guarantees that degrade gracefully with the degree of model misspecification.
Partial identification. When counterfactual quantities are not point-identified from observational data, partial identification approaches compute bounds on the degree of counterfactual unfairness. This allows practitioners to assess fairness even when the causal model does not fully determine counterfactual outcomes.
Combining factual and counterfactual predictions. A 2024 NeurIPS paper explored methods for combining factual predictions (based on observed data) with counterfactual predictions (based on counterfactual reasoning) to achieve a balance between predictive accuracy and counterfactual fairness.
Comprehensive causal fairness frameworks. Plecko and Bareinboim (2024) published a comprehensive monograph on causal fairness analysis in Foundations and Trends in Machine Learning. Their framework introduces the Fairness Map, which links observed disparities to underlying causal mechanisms and provides a unified toolkit for causal fairness analysis.