Causal inference
Last reviewed
Apr 30, 2026
Sources
34 citations
Review status
Source-backed
Revision
v1 · 4,787 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
34 citations
Review status
Source-backed
Revision
v1 · 4,787 words
Add missing citations, update stale details, or suggest a clearer explanation.
Causal inference is the field of study concerned with drawing conclusions about cause-and-effect relationships from data. It develops the assumptions, models, estimands, and estimators needed to answer questions of the form "what would happen to outcome Y if we intervened on variable X?" rather than the weaker statistical question "how does Y vary with X across observed cases?" The central technical task is to bridge the gap between an associational quantity, which can be read off any joint distribution, and an interventional or counterfactual quantity, which generally cannot.
The field combines work from statistics, economics, epidemiology, computer science, and philosophy. Its modern form owes most to two mathematical traditions that grew up in parallel during the late twentieth century: the potential outcomes framework associated with Jerzy Neyman and Donald Rubin, and the structural causal model framework associated with Judea Pearl. Both traditions are now used routinely in randomised trials, observational studies, policy evaluation, algorithmic fairness, and increasingly in machine learning.
Most predictive regression and classification systems are trained to minimise loss on a fixed joint distribution between inputs and outputs. They learn statistical associations, not mechanisms. This is fine when the deployment distribution matches the training distribution and when nobody intervenes on the inputs, but it breaks in three situations that come up constantly in practice.
The first is distribution shift. A model that learned that snow co-occurs with husky photographs will misclassify a husky on grass, because the spurious feature it relied on is no longer present. Causal models, which separate the data-generating mechanism from the distribution it happens to produce, are designed to be more robust to such shifts. Schölkopf and colleagues argue in their 2021 review "Toward Causal Representation Learning" that this is one of the strongest motivations for taking causality seriously in modern machine learning.
The second is intervention. If a credit scoring model is used to decide who gets a loan, and the bank changes its policy on the basis of model predictions, the joint distribution between features and repayment changes. Predictive accuracy on historical data tells you almost nothing about the effect of the new policy. You need a model of what changes when you intervene.
The third is fairness reasoning. Many definitions of fairness are inherently counterfactual: would the same applicant have received the same decision if their race had been different, holding everything causally upstream of race fixed? See counterfactual fairness and individual fairness for two closely related approaches.
Causal inference is dominated by two formalisms that look superficially different but turn out to express the same content.
The potential outcomes framework, also called the Neyman-Rubin causal model, was first introduced by Jerzy Neyman in his 1923 master's thesis on randomised experiments and generalised by Donald Rubin in 1974 to observational studies. For each unit i and each treatment level t, the model posits a potential outcome Y_i(t) that would have been observed if unit i had received treatment t. For binary treatment we write Y_i(1) and Y_i(0).
The individual treatment effect is the contrast Y_i(1) − Y_i(0). The fundamental problem of causal inference, in Holland's well-known phrase, is that for any given unit we observe only one of the two potential outcomes. The other is a missing counterfactual.
Identification of population quantities therefore requires extra assumptions. The most commonly used population estimand is the average treatment effect ATE = E[Y(1) − Y(0)]. Variants include the average treatment effect on the treated (ATT), the average treatment effect on the controls (ATC), and the conditional average treatment effect CATE(x) = E[Y(1) − Y(0) | X = x], which captures heterogeneity across covariates.
Under the assumptions of stable unit treatment values (SUTVA), unconfoundedness (also called conditional ignorability or strong ignorability), and positivity, the ATE is identified from observational data by adjusting for covariates. Imbens and Rubin's 2015 textbook "Causal Inference for Statistics, Social, and Biomedical Sciences" is the standard reference and won the 2016 PROSE Award for Textbook in the Social Sciences.
The structural causal model (SCM) approach, developed largely by Judea Pearl in the 1990s and consolidated in his 2009 textbook "Causality: Models, Reasoning, and Inference" (second edition, Cambridge University Press), uses directed acyclic graphs (DAGs) over variables together with structural equations. Each node represents a variable; each directed edge represents a direct causal influence; each variable is determined by a deterministic function of its parents and an exogenous noise term.
Intervention is represented by the do-operator. The expression do(X = x) denotes the operation of setting X to value x by external action, severing the incoming edges to X in the DAG. The post-intervention distribution P(Y | do(X = x)) is in general different from the conditional distribution P(Y | X = x). The do-calculus, introduced by Pearl in 1995, is a set of three rewriting rules that determine when an interventional expression can be rewritten in terms of observational quantities.
The three rules concern, respectively, the insertion or deletion of observations, the exchange of an action for an observation, and the insertion or deletion of actions. Huang and Valtorta, and independently Shpitser and Pearl, proved in 2006 that the three rules are complete: if a causal effect is identifiable from the graph plus observational data at all, then a finite sequence of do-calculus rewrites will identify it.
Pearl and Mackenzie's 2018 popular book "The Book of Why" introduced the ladder of causation as a way to organise causal reasoning into three rungs:
| Rung | Activity | Question type | Example |
|---|---|---|---|
| 1. Association | Seeing | What is? P(y | x) |
| 2. Intervention | Doing | What if I do? P(y | do(x)) |
| 3. Counterfactual | Imagining | What if I had done? P(y_x | x', y') |
A model that supports rung 3 also supports rungs 1 and 2, but not vice versa. Most pure machine learning lives at rung 1.
The two frameworks are formally interchangeable. Pearl, Bareinboim, and others have shown that any quantity expressible in one can be expressed in the other, and that the same identification problems have the same answers in both. The choice between them is largely one of convenience: graphical models are usually clearer for reasoning about confounding structure and for designing identification strategies; potential outcomes are usually clearer for defining estimands and for finite-sample estimation. Hernán and Robins' textbook "Causal Inference: What If", freely available online, deliberately uses both and translates between them.
| Aspect | Potential outcomes | Structural causal models |
|---|---|---|
| Primary objects | Counterfactual random variables Y(t) | DAGs and structural equations |
| Intervention | Switching the realised treatment | do(X = x), graph mutilation |
| Strength for | Estimation, finite-sample inference | Identification, qualitative reasoning |
| Canonical references | Rubin 1974; Imbens and Rubin 2015 | Pearl 2009; Spirtes, Glymour, Scheines 2000 |
| Extra assumptions named | SUTVA, ignorability, positivity | Markov, faithfulness, causal sufficiency |
| Term | Meaning |
|---|---|
| Treatment / intervention | The variable whose causal effect is of interest, often binary |
| Outcome | The variable whose response to treatment is of interest |
| Confounder | A common cause of treatment and outcome that biases the naive comparison |
| Mediator | A variable on the causal path from treatment to outcome |
| Collider | A common effect of two variables; conditioning on it opens a non-causal path |
| Instrumental variable (IV) | A variable that affects the outcome only through the treatment |
| Backdoor path | A non-causal path between treatment and outcome that starts with an arrow into the treatment |
| Backdoor criterion | A graphical condition on a set Z guaranteeing it suffices to adjust for, by blocking all backdoor paths without opening colliders |
| Frontdoor criterion | An alternative graphical condition that allows identification through a fully mediating variable, even when an unobserved confounder exists |
| ATE | Average treatment effect E[Y(1) − Y(0)] in the population |
| CATE | Conditional average treatment effect E[Y(1) − Y(0) |
| Counterfactual | The value of an outcome under a treatment that was not actually applied |
| Identifiability | Whether a causal estimand can be uniquely written as a function of the observable distribution under the assumptions |
| Estimation | The statistical task of constructing an estimator for an identified estimand from a finite sample |
| SUTVA | No interference across units, and a single well-defined version of each treatment |
| Positivity / overlap | Every covariate stratum has positive probability of receiving each treatment level |
| Ignorability | Conditional on covariates, treatment assignment is independent of potential outcomes |
A confounder, a mediator, and a collider can all be associated with both treatment and outcome, and they look identical in a contingency table. They are distinguished only by the causal structure that produced the data, which is exactly why writing down a DAG (or its potential-outcome equivalent) matters.
When randomisation is impossible or unethical, applied researchers rely on a small toolkit of identification strategies. Each one substitutes a specific structural assumption for the missing experimental control.
The randomised controlled trial (RCT) remains the gold standard. Random assignment forces treatment to be independent of potential outcomes by construction, so the simple difference in sample means is an unbiased estimator of the ATE. Everything else in observational causal inference can be read as an attempt to recover the conditions of a randomised trial when the data were not generated by one.
Given a sufficient adjustment set Z (one that satisfies the backdoor criterion), one can identify P(Y | do(X)) by adjusting for Z. Common implementations include:
When unmeasured confounders make the backdoor criterion impossible, an instrumental variable Z that affects Y only through X can still identify a causal effect. Imbens and Angrist showed in 1994 that under monotonicity, the IV estimand recovers the local average treatment effect (LATE) for the subpopulation of compliers. Angrist, Imbens, and Rubin's 1996 paper formalised the connection between IV and the potential outcomes framework.
The regression discontinuity design (RDD) was introduced by Donald Thistlethwaite and Donald Campbell in 1960 to evaluate the effect of a National Merit Scholarship recognition on later academic outcomes. Treatment is assigned by whether a continuous running variable exceeds a known cutoff. Units just above and just below the cutoff are treated as locally comparable, and the discontinuity in the conditional expectation of Y at the cutoff identifies a local treatment effect.
Difference-in-differences (DiD) compares the change in outcome over time in a treated group with the change in a control group, under the parallel-trends assumption that the control's trend is what the treated group's trend would have been absent treatment. Card and Krueger's 1994 New Jersey minimum-wage study is a canonical application.
The synthetic control method, introduced by Alberto Abadie and Javier Gardeazabal in 2003 in their study of the economic cost of conflict in the Basque Country, and developed further in Abadie, Diamond, and Hainmueller's 2010 paper on California's Proposition 99 tobacco-control program, constructs a weighted combination of untreated units that approximates the pre-treatment trajectory of a single treated unit. The post-treatment difference is the estimated effect. The 2010 paper estimated that by 2000, annual per-capita cigarette sales in California were about 26 packs lower than they would have been without Proposition 99, and is now cited many thousands of times.
The front-door criterion, proved by Pearl in the 1990s, identifies P(Y | do(X)) when there is no usable adjustment set, provided one can find a fully mediating variable M whose own backdoor paths are blockable. It is rarely applicable in practice but is theoretically important because it shows that observational identification is not exhausted by the backdoor criterion.
| Strategy | Key assumption | Estimand identified |
|---|---|---|
| Randomised trial | Random assignment | ATE |
| Backdoor adjustment | No unmeasured confounders given Z | ATE, CATE |
| Propensity matching / IPW | Ignorability and positivity | ATE, ATT |
| Doubly robust (AIPW, TMLE) | One of outcome or propensity model correct | ATE, CATE |
| Instrumental variable | Exclusion, relevance, monotonicity | LATE |
| Regression discontinuity | Continuity at the cutoff | Local ATE at cutoff |
| Difference-in-differences | Parallel trends | ATT in treated period |
| Synthetic control | Pre-treatment fit, no spillover | Effect on the treated unit |
| Front-door adjustment | Mediator with blockable backdoor | ATE |
Causal discovery is the task of learning the causal graph itself from data, rather than estimating an effect given a graph. Several broad families exist.
Constraint-based methods test conditional independence statements in the data and find graphs consistent with the resulting set of constraints. The PC algorithm of Peter Spirtes and Clark Glymour, introduced in 1991 and treated comprehensively in Spirtes, Glymour, and Scheines' 2000 book "Causation, Prediction, and Search", is the canonical example. PC assumes causal sufficiency: that there are no latent common causes among the measured variables. The Fast Causal Inference (FCI) algorithm, due to Spirtes in 1995, drops causal sufficiency and outputs a partial ancestral graph (PAG) whose edge marks express uncertainty introduced by possible latent confounders.
Score-based methods search the space of DAGs for one that maximises a model-fit score, usually a penalised likelihood such as the Bayesian Information Criterion. Greedy Equivalence Search (GES), due to David Maxwell Chickering in 2002, searches over equivalence classes rather than individual DAGs and is asymptotically consistent.
Continuous-optimisation methods reformulate DAG learning as a smooth optimisation problem. NOTEARS, by Xun Zheng, Bryon Aragam, Pradeep Ravikumar, and Eric Xing in 2018, encodes the acyclicity constraint as a smooth function of the weighted adjacency matrix, allowing standard gradient descent solvers to be used. Subsequent work has extended the idea to nonlinear and neural-network parametrisations.
Functional causal models exploit asymmetries in the data-generating process. LiNGAM, the Linear Non-Gaussian Acyclic Model of Shimizu, Hoyer, Hyvärinen, and Kerminen in 2006, identifies the full causal ordering of continuous variables under the assumption that the structural equations are linear and the noise terms are non-Gaussian, using independent component analysis.
For most of its history, causal inference treated the regression of Y on X (and similar steps) as nuisance pieces of a larger argument. Modern work has plugged in flexible machine-learning methods at exactly those points, while keeping the causal scaffolding intact.
The most active area is estimation of CATE functions. Causal forests, introduced by Susan Athey and Stefan Wager in their 2018 Journal of the American Statistical Association paper "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests", adapt Breiman's random forest to estimate CATE pointwise with a Gaussian asymptotic distribution and valid confidence intervals.
Meta-learners are recipes for turning any supervised regression algorithm into a CATE estimator. Künzel, Sekhon, Bickel, and Yu's 2019 PNAS paper "Metalearners for Estimating Heterogeneous Treatment Effects using Machine Learning" formalised four widely used variants:
| Meta-learner | Idea | Best when |
|---|---|---|
| S-learner | Single model with treatment as a feature | CATE is approximately zero or smooth |
| T-learner | Two separate models for treated and control | Response surfaces differ substantially |
| X-learner | Cross-fits residuals from each group on the other | Treatment groups are imbalanced |
| R-learner | Robinson-style residual orthogonalisation (Nie and Wager 2021) | High-dimensional nuisance, want orthogonality |
Bayesian causal forests, due to P. Richard Hahn, Jared Murray, and Carlos Carvalho in their 2020 Bayesian Analysis paper, combine bayesian inference with treelike models and have become a default in many policy applications.
Doubly robust estimators have been generalised to the CATE setting in the form of doubly robust learners (DRL) and double machine learning (DML, Chernozhukov and colleagues 2018). Targeted Minimum Loss-based Estimation (TMLE), developed by Mark van der Laan and colleagues at Berkeley and consolidated in van der Laan and Rose's 2011 book "Targeted Learning", is a closely related class of methods that perform a final "targeting" step to remove plug-in bias while keeping the plug-in property of respecting outcome bounds.
A newer line of work asks where the causal variables themselves come from. In real perception problems, the raw observations are pixels or sensor readings, not abstract "smoking" and "cancer" nodes. Causal representation learning aims to extract latent variables from such observations that admit a causal model. Schölkopf, Locatello, Bauer, Ke, Kalchbrenner, Goyal, and Bengio's 2021 Proceedings of the IEEE article "Toward Causal Representation Learning" is the standard reference and ties the agenda to issues of transfer, generalisation, and robustness in deep neural network systems.
The Causal Effect Variational Autoencoder (CEVAE), introduced by Christos Louizos and colleagues at NeurIPS 2017, treats unobserved confounders as latent variables in a variational autoencoder and recovers treatment effects from noisy proxies of the confounders.
The relationship between large language models and causal reasoning is an active and unsettled research area. LLMs can recite definitions of confounders and parse causal questions, but their ability to perform formal identification or to generalise to new causal structures is limited and the empirical results are mixed. The general lesson, repeated across the causal inference literature since Pearl, is that causal answers require causal assumptions, and a system trained purely to predict the next token has no obvious mechanism for surfacing those assumptions.
Many of the fairness criteria that look interesting have a counterfactual flavour. Counterfactual fairness, defined by Matt Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva in their 2017 NeurIPS paper of the same name, requires that a decision is the same in the actual world and in a counterfactual world where the protected attribute is changed and everything causally upstream is held fixed. Path-specific effects refine this by allowing some causal pathways from the protected attribute to the outcome (such as legitimate qualifications) while forbidding others (such as direct discrimination). See algorithmic fairness, bias (ethics/fairness), counterfactual fairness, individual fairness, fairness constraint, fairness metric, fairness terms, incompatibility of fairness metrics, machine learning terms/fairness, unawareness (fairness through unawareness), and experimenter's bias for related entries.
Reinforcement learning and causal inference share their core question: what is the value of acting differently? Off-policy evaluation (OPE) estimates the expected return of a target policy from data collected under a different behaviour policy, exactly the situation that causal inference was designed for. Importance-sampling estimators reweight observed trajectories by the ratio of target to behaviour policy probabilities, which is the sequential analogue of inverse probability weighting. Doubly robust and marginalised importance-sampling estimators carry over from the static causal inference literature and reduce the exponential variance of naive estimators in long horizons.
| Library | Language | Maintainer | Strength |
|---|---|---|---|
| DoWhy | Python | Microsoft Research (Sharma and Kiciman) | Unified four-step API; strong on assumption testing and refutation |
| EconML | Python | Microsoft ALICE | State-of-the-art CATE estimators (DML, doubly robust, deep IV) |
| CausalML | Python | Uber | Uplift modelling, meta-learners, neural-network estimators |
| causalgraphicalmodels | Python | (open source) | Pedagogical DAG manipulation and identifiability |
| grf | R | Athey, Tibshirani, Wager | Generalized random forests including causal forest |
| tmle | R | van der Laan group | Targeted maximum likelihood implementations |
| pcalg | R | Spirtes, Maathuis et al. | PC, FCI, GES and other discovery algorithms |
| Pyro CEVAE | Python | Uber AI Labs | Reference CEVAE implementation in a probabilistic programming language |
DoWhy, EconML, and CausalML are commonly used together in production pipelines: DoWhy for stating and refuting the causal assumptions, EconML or CausalML for the heavy-machinery estimation step.
Causal inference is unforgiving in a way that prediction is not. A model with bad predictions tells on itself in held-out error; a causal estimate built on the wrong assumptions can be confidently wrong with no internal warning sign.
Garbage in, garbage out. Causal claims are only as good as the assumptions they rest on, and most of those assumptions (no unmeasured confounders, positivity, SUTVA, parallel trends, exclusion restrictions) are untestable from the data alone. Sensitivity analyses, placebo tests, and cross-design replication exist precisely because point estimates without a credibility argument do not deserve to be believed. Athey and Imbens's 2017 Journal of Economic Perspectives review "The State of Applied Econometrics" is largely organised around this point.
Spurious correlations. A famous teaching example is Franz Messerli's 2012 New England Journal of Medicine note showing a strong national correlation between per-capita chocolate consumption and Nobel laureates per capita. The correlation is real; the causal story ("chocolate makes you smarter") is a joke that more than a few news outlets failed to get. The correlation is driven by national wealth, education systems, and many other common causes.
Simpson's paradox. Two variables can be positively associated overall, yet negatively associated in every subgroup, when there is a confounder. The 1973 University of California, Berkeley graduate admissions case is the canonical example: the headline acceptance rate looked biased against women, but within almost every department the rate was slightly higher for women; women applied to more competitive departments. Resolving the paradox requires deciding which causal model corresponds to the question being asked, and the resolution is different for different questions.
Berkson's paradox and collider bias. Conditioning on a common effect of two variables creates an association between them, even if they were originally independent. Hospital-based studies that condition implicitly on being sick enough to be admitted, dating-app data that conditions on appearing in a swipe pool, and academic studies that condition on having published produce non-causal correlations of exactly this kind.
Pearl's critique of pure deep learning. Pearl has repeatedly argued, including in "The Book of Why" and in subsequent writing, that systems that operate purely at the associational rung will not deliver causal answers no matter how much data they consume. The critique is not a denial of deep learning's value at rung 1; it is an argument that adding rungs 2 and 3 requires extra structure, and that pretending otherwise leads to confidently wrong policy advice. Whether and how this structure can be learned, rather than imposed, is one of the questions that animates causal representation learning today.