Causal inference

Machine Learning Statistics

25 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

35 citations

Revision

v3 · 5,047 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Causal inference is the field of study concerned with drawing conclusions about cause-and-effect relationships from data, answering questions of the form "what would happen to outcome Y if we intervened on variable X?" rather than the weaker statistical question "how does Y vary with X across observed cases?" Its central technical task is to bridge the gap between an associational quantity, written $P(Y \mid X = x)$ , which can be read off any joint distribution, and an interventional or counterfactual quantity, written $P(Y \mid \mathrm{do}(X = x))$ , which generally cannot.^[1] The two are different in general: the difference in coffee drinking between people who do and do not have lung cancer is not the effect of drinking coffee on cancer.

The field combines work from statistics, economics, epidemiology, computer science, and philosophy. Its modern form owes most to two mathematical traditions that grew up in parallel during the twentieth century: the potential outcomes framework associated with Jerzy Neyman (1923) and Donald Rubin (1974), and the structural causal model framework associated with Judea Pearl.^[3]^[4]^[1] Both traditions are now used routinely in randomised trials, observational studies, policy evaluation, algorithmic fairness, and increasingly in machine learning.

Why does causal inference matter for AI and machine learning?

Most predictive regression and classification systems are trained to minimise loss on a fixed joint distribution between inputs and outputs. They learn statistical associations, not mechanisms. This is fine when the deployment distribution matches the training distribution and when nobody intervenes on the inputs, but it breaks in three situations that come up constantly in practice.

The first is distribution shift. A model that learned that snow co-occurs with husky photographs will misclassify a husky on grass, because the spurious feature it relied on is no longer present. This is not a thought experiment: the LIME paper by Ribeiro, Singh, and Guestrin (2016) showed a classifier that distinguished wolves from huskies almost entirely on the presence of snow in the background.^[35] Causal models, which separate the data-generating mechanism from the distribution it happens to produce, are designed to be more robust to such shifts. Scholkopf and colleagues argue in their 2021 review "Toward Causal Representation Learning" that this is one of the strongest motivations for taking causality seriously in modern machine learning, reviewing "fundamental concepts of causal inference" and "thereby assaying how causality can contribute to modern machine learning research."^[20]

The second is intervention. If a credit scoring model is used to decide who gets a loan, and the bank changes its policy on the basis of model predictions, the joint distribution between features and repayment changes. Predictive accuracy on historical data tells you almost nothing about the effect of the new policy. You need a model of what changes when you intervene.

The third is fairness reasoning. Many definitions of fairness are inherently counterfactual: would the same applicant have received the same decision if their race had been different, holding everything causally upstream of race fixed? See counterfactual fairness and individual fairness for two closely related approaches.

Two main frameworks

Causal inference is dominated by two formalisms that look superficially different but turn out to express the same content.

Potential outcomes (Neyman-Rubin causal model)

The potential outcomes framework, also called the Neyman-Rubin causal model, was first introduced by Jerzy Neyman in his 1923 master's thesis on randomised agricultural experiments and generalised by Donald Rubin in 1974 to observational studies.^[4]^[3] For each unit i and each treatment level t, the model posits a potential outcome $Y_i(t)$ that would have been observed if unit i had received treatment t. For binary treatment we write $Y_i(1)$ and $Y_i(0)$ .

The individual treatment effect is the contrast $Y_i(1) - Y_i(0)$ . The fundamental problem of causal inference, in Paul Holland's well-known 1986 phrase, is that for any given unit we observe only one of the two potential outcomes.^[31] The other is a missing counterfactual, which is why the framework is often described as a missing-data problem.

Identification of population quantities therefore requires extra assumptions. The most commonly used population estimand is the average treatment effect $\text{ATE} = \mathbb{E}[Y(1) - Y(0)]$ . Variants include the average treatment effect on the treated (ATT), the average treatment effect on the controls (ATC), and the conditional average treatment effect $\text{CATE}(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x]$ , which captures heterogeneity across covariates.

Under the assumptions of stable unit treatment values (SUTVA), unconfoundedness (also called conditional ignorability or strong ignorability), and positivity, the ATE is identified from observational data by adjusting for covariates. Imbens and Rubin's 2015 textbook "Causal Inference for Statistics, Social, and Biomedical Sciences" is the standard reference for this tradition.^[5]

Structural causal models (Pearl's approach)

The structural causal model (SCM) approach, developed largely by Judea Pearl in the 1990s and consolidated in his 2009 textbook "Causality: Models, Reasoning, and Inference" (second edition, Cambridge University Press), uses directed acyclic graphs (DAGs) over variables together with structural equations.^[1] Each node represents a variable; each directed edge represents a direct causal influence; each variable is determined by a deterministic function of its parents and an exogenous noise term. The same DAG machinery underpins Bayesian networks, but a structural causal model adds the interventional semantics that a purely probabilistic Bayesian network lacks.

Intervention is represented by the do-operator. The expression $\mathrm{do}(X = x)$ denotes the operation of setting X to value x by external action, severing the incoming edges to X in the DAG. The post-intervention distribution $P(Y \mid \mathrm{do}(X = x))$ is in general different from the conditional distribution $P(Y \mid X = x)$ . The do-calculus, introduced by Pearl in 1995, is a set of three rewriting rules that determine when an interventional expression can be rewritten in terms of observational quantities.^[1]

The three rules concern, respectively, the insertion or deletion of observations, the exchange of an action for an observation, and the insertion or deletion of actions. Huang and Valtorta, and independently Shpitser and Pearl, proved in 2006 that the three rules are complete: if a causal effect is identifiable from the graph plus observational data at all, then a finite sequence of do-calculus rewrites will identify it, and if the rules cannot eliminate the do-operator then the effect is not identifiable.^[29]^[30]

Pearl and Mackenzie's 2018 popular book "The Book of Why" introduced the ladder of causation as a way to organise causal reasoning into three rungs.^[2] Pearl uses the ladder to argue that data alone are insufficient for causal questions: "We live in an era that presumes Big Data to be the solution to all our problems, but I hope with this book to convince you that data are profoundly dumb."^[2]

Rung	Activity	Question type	Example
1. Association	Seeing	What is? $P(y \mid x)$	How are symptoms and disease related?
2. Intervention	Doing	What if I do? $P(y \mid \mathrm{do}(x))$	What if I take aspirin?
3. Counterfactual	Imagining	What if I had done? $P(y_x \mid x', y')$	Would the patient have lived if she had taken the drug?

A model that supports rung 3 also supports rungs 1 and 2, but not vice versa. Most pure machine learning lives at rung 1.

How do the two frameworks differ?

The two frameworks are formally interchangeable. Pearl, Bareinboim, and others have shown that any quantity expressible in one can be expressed in the other, and that the same identification problems have the same answers in both. The choice between them is largely one of convenience: graphical models are usually clearer for reasoning about confounding structure and for designing identification strategies; potential outcomes are usually clearer for defining estimands and for finite-sample estimation. Hernan and Robins' textbook "Causal Inference: What If", freely available online, deliberately uses both and translates between them.^[6]

Aspect	Potential outcomes	Structural causal models
Primary objects	Counterfactual random variables Y(t)	DAGs and structural equations
Intervention	Switching the realised treatment	do(X = x), graph mutilation
Strength for	Estimation, finite-sample inference	Identification, qualitative reasoning
Canonical references	Rubin 1974; Imbens and Rubin 2015	Pearl 2009; Spirtes, Glymour, Scheines 2000
Extra assumptions named	SUTVA, ignorability, positivity	Markov, faithfulness, causal sufficiency

Key concepts

Term	Meaning
Treatment / intervention	The variable whose causal effect is of interest, often binary
Outcome	The variable whose response to treatment is of interest
Confounder	A common cause of treatment and outcome that biases the naive comparison
Mediator	A variable on the causal path from treatment to outcome
Collider	A common effect of two variables; conditioning on it opens a non-causal path
Instrumental variable (IV)	A variable that affects the outcome only through the treatment
Backdoor path	A non-causal path between treatment and outcome that starts with an arrow into the treatment
Backdoor criterion	A graphical condition on a set Z guaranteeing it suffices to adjust for, by blocking all backdoor paths without opening colliders
Frontdoor criterion	An alternative graphical condition that allows identification through a fully mediating variable, even when an unobserved confounder exists
ATE	Average treatment effect $\mathbb{E}[Y(1) - Y(0)]$ in the population
CATE	Conditional average treatment effect $\mathbb{E}[Y(1) - Y(0) \mid X = x]$
Counterfactual	The value of an outcome under a treatment that was not actually applied
Identifiability	Whether a causal estimand can be uniquely written as a function of the observable distribution under the assumptions
Estimation	The statistical task of constructing an estimator for an identified estimand from a finite sample
SUTVA	No interference across units, and a single well-defined version of each treatment
Positivity / overlap	Every covariate stratum has positive probability of receiving each treatment level
Ignorability	Conditional on covariates, treatment assignment is independent of potential outcomes

A confounder, a mediator, and a collider can all be associated with both treatment and outcome, and they look identical in a contingency table. They are distinguished only by the causal structure that produced the data, which is exactly why writing down a DAG (or its potential-outcome equivalent) matters.

What identification strategies are used in observational studies?

When randomisation is impossible or unethical, applied researchers rely on a small toolkit of identification strategies. Each one substitutes a specific structural assumption for the missing experimental control.

Randomised controlled trial

The randomised controlled trial (RCT) remains the gold standard. Random assignment forces treatment to be independent of potential outcomes by construction, so the simple difference in sample means is an unbiased estimator of the ATE. Everything else in observational causal inference can be read as an attempt to recover the conditions of a randomised trial when the data were not generated by one.

Adjustment for confounders

Given a sufficient adjustment set Z (one that satisfies the backdoor criterion), one can identify $P(Y \mid \mathrm{do}(X))$ by adjusting for Z. Common implementations include:

Outcome regression using linear regression, logistic regression, ridge regression, lasso regression, least squares regression, multinomial regression, or multi-class logistic regression. For continuous outcomes the conditional mean $\mathbb{E}[Y \mid X, Z]$ is fit and then averaged over the empirical distribution of Z.
Propensity-score matching, where each treated unit is matched to one or more controls with similar estimated probability of treatment given covariates.
Inverse probability weighting (IPW), where each unit is reweighted by the inverse of its estimated treatment probability so the weighted distribution mimics what randomisation would have produced.
Doubly robust estimators, including the augmented inverse probability weighting (AIPW) estimator, which combine an outcome model and a propensity model and remain consistent if either model is correctly specified.

Instrumental variables

When unmeasured confounders make the backdoor criterion impossible, an instrumental variable Z that affects Y only through X can still identify a causal effect. Imbens and Angrist showed in 1994 that under monotonicity, the IV estimand recovers the local average treatment effect (LATE) for the subpopulation of compliers.^[23] Angrist, Imbens, and Rubin's 1996 paper formalised the connection between IV and the potential outcomes framework.^[24] Joshua Angrist and Guido Imbens shared the 2021 Nobel Memorial Prize in Economic Sciences for this methodological work on causal relationships.

Regression discontinuity design

The regression discontinuity design (RDD) was introduced by Donald Thistlethwaite and Donald Campbell in 1960 to evaluate the effect of a National Merit Scholarship recognition on later academic outcomes.^[25] Treatment is assigned by whether a continuous running variable exceeds a known cutoff. Units just above and just below the cutoff are treated as locally comparable, and the discontinuity in the conditional expectation of Y at the cutoff identifies a local treatment effect.

Difference-in-differences

Difference-in-differences (DiD) compares the change in outcome over time in a treated group with the change in a control group, under the parallel-trends assumption that the control's trend is what the treated group's trend would have been absent treatment. Card and Krueger's 1994 New Jersey minimum-wage study is a canonical application.^[26]

Synthetic control method

The synthetic control method, introduced by Alberto Abadie and Javier Gardeazabal in 2003 in their study of the economic cost of conflict in the Basque Country, and developed further in Abadie, Diamond, and Hainmueller's 2010 paper on California's Proposition 99 tobacco-control program, constructs a weighted combination of untreated units that approximates the pre-treatment trajectory of a single treated unit.^[27]^[28] The post-treatment difference is the estimated effect. The 2010 paper estimated that by 2000, annual per-capita cigarette sales in California were about 26 packs lower than they would have been without Proposition 99.^[28]

Front-door adjustment

The front-door criterion, proved by Pearl in the 1990s, identifies $P(Y \mid \mathrm{do}(X))$ when there is no usable adjustment set, provided one can find a fully mediating variable M whose own backdoor paths are blockable.^[1] It is rarely applicable in practice but is theoretically important because it shows that observational identification is not exhausted by the backdoor criterion.

Strategy	Key assumption	Estimand identified
Randomised trial	Random assignment	ATE
Backdoor adjustment	No unmeasured confounders given Z	ATE, CATE
Propensity matching / IPW	Ignorability and positivity	ATE, ATT
Doubly robust (AIPW, TMLE)	One of outcome or propensity model correct	ATE, CATE
Instrumental variable	Exclusion, relevance, monotonicity	LATE
Regression discontinuity	Continuity at the cutoff	Local ATE at cutoff
Difference-in-differences	Parallel trends	ATT in treated period
Synthetic control	Pre-treatment fit, no spillover	Effect on the treated unit
Front-door adjustment	Mediator with blockable backdoor	ATE

Causal discovery

Causal discovery is the task of learning the causal graph itself from data, rather than estimating an effect given a graph. Several broad families exist.

Constraint-based methods test conditional independence statements in the data and find graphs consistent with the resulting set of constraints. The PC algorithm of Peter Spirtes and Clark Glymour, introduced in 1991 and treated comprehensively in Spirtes, Glymour, and Scheines' 2000 book "Causation, Prediction, and Search", is the canonical example.^[8]^[7] PC assumes causal sufficiency: that there are no latent common causes among the measured variables. The Fast Causal Inference (FCI) algorithm, due to Spirtes in 1995, drops causal sufficiency and outputs a partial ancestral graph (PAG) whose edge marks express uncertainty introduced by possible latent confounders.^[9]

Score-based methods search the space of DAGs for one that maximises a model-fit score, usually a penalised likelihood such as the Bayesian Information Criterion. Greedy Equivalence Search (GES), due to David Maxwell Chickering in 2002, searches over equivalence classes rather than individual DAGs and is asymptotically consistent.^[10]

Continuous-optimisation methods reformulate DAG learning as a smooth optimisation problem. NOTEARS, by Xun Zheng, Bryon Aragam, Pradeep Ravikumar, and Eric Xing in 2018, encodes the acyclicity constraint as a smooth function of the weighted adjacency matrix, allowing standard gradient descent solvers to be used.^[12] Subsequent work has extended the idea to nonlinear and neural-network parametrisations.

Functional causal models exploit asymmetries in the data-generating process. LiNGAM, the Linear Non-Gaussian Acyclic Model of Shimizu, Hoyer, Hyvarinen, and Kerminen in 2006, identifies the full causal ordering of continuous variables under the assumption that the structural equations are linear and the noise terms are non-Gaussian, using independent component analysis.^[11]

How does causal inference intersect with modern machine learning?

For most of its history, causal inference treated the regression of Y on X (and similar steps) as nuisance pieces of a larger argument. Modern work has plugged in flexible machine-learning methods at exactly those points, while keeping the causal scaffolding intact.

Heterogeneous treatment effects

The most active area is estimation of CATE functions. Causal forests, introduced by Susan Athey and Stefan Wager in their 2018 Journal of the American Statistical Association paper "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests", adapt Breiman's random forest to estimate CATE pointwise with a Gaussian asymptotic distribution and valid confidence intervals.^[14]

Meta-learners are recipes for turning any supervised regression algorithm into a CATE estimator. Kunzel, Sekhon, Bickel, and Yu's 2019 PNAS paper "Metalearners for Estimating Heterogeneous Treatment Effects using Machine Learning" formalised four widely used variants.^[15]

Meta-learner	Idea	Best when
S-learner	Single model with treatment as a feature	CATE is approximately zero or smooth
T-learner	Two separate models for treated and control	Response surfaces differ substantially
X-learner	Cross-fits residuals from each group on the other	Treatment groups are imbalanced
R-learner	Robinson-style residual orthogonalisation (Nie and Wager 2021)^[16]	High-dimensional nuisance, want orthogonality

Bayesian causal forests, due to P. Richard Hahn, Jared Murray, and Carlos Carvalho in their 2020 Bayesian Analysis paper, combine Bayesian inference with treelike models and have become a default in many policy applications.^[17]

Doubly robust learners and TMLE

Doubly robust estimators have been generalised to the CATE setting in the form of doubly robust learners (DRL) and double machine learning (DML, Chernozhukov and colleagues 2018).^[18] Targeted Minimum Loss-based Estimation (TMLE), developed by Mark van der Laan and colleagues at Berkeley and consolidated in van der Laan and Rose's 2011 book "Targeted Learning", is a closely related class of methods that perform a final "targeting" step to remove plug-in bias while keeping the plug-in property of respecting outcome bounds.^[19]

Causal representation learning

A newer line of work asks where the causal variables themselves come from. In real perception problems, the raw observations are pixels or sensor readings, not abstract "smoking" and "cancer" nodes. Causal representation learning, defined by Scholkopf and colleagues as "the discovery of high-level causal variables from low-level observations," aims to extract latent variables from such observations that admit a causal model.^[20] Scholkopf, Locatello, Bauer, Ke, Kalchbrenner, Goyal, and Yoshua Bengio's 2021 Proceedings of the IEEE article "Toward Causal Representation Learning" is the standard reference and ties the agenda to issues of transfer, generalisation, and robustness in deep neural network systems.^[20]

The Causal Effect Variational Autoencoder (CEVAE), introduced by Christos Louizos and colleagues at NeurIPS 2017, treats unobserved confounders as latent variables in a variational autoencoder and recovers treatment effects from noisy proxies of the confounders.^[21]

Large language models and causality

The relationship between large language models and causal reasoning is an active and unsettled research area. LLMs can recite definitions of confounders and parse causal questions, but their ability to perform formal identification or to generalise to new causal structures is limited and the empirical results are mixed. The general lesson, repeated across the causal inference literature since Pearl, is that causal answers require causal assumptions, and a system trained purely to predict the next token has no obvious mechanism for surfacing those assumptions.^[2]

Causality in fairness

Many of the fairness criteria that look interesting have a counterfactual flavour. Counterfactual fairness, defined by Matt Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva in their 2017 NeurIPS paper of the same name, requires that a decision is the same in the actual world and in a counterfactual world where the protected attribute is changed and everything causally upstream is held fixed.^[22] Path-specific effects refine this by allowing some causal pathways from the protected attribute to the outcome (such as legitimate qualifications) while forbidding others (such as direct discrimination). See algorithmic fairness, bias (ethics/fairness), counterfactual fairness, individual fairness, fairness constraint, fairness metric, fairness terms, incompatibility of fairness metrics, machine learning terms/fairness, unawareness (fairness through unawareness), and experimenter's bias for related entries.

Causality in reinforcement learning and policy evaluation

Reinforcement learning and causal inference share their core question: what is the value of acting differently? Off-policy evaluation (OPE) estimates the expected return of a target policy from data collected under a different behaviour policy, exactly the situation that causal inference was designed for. Importance-sampling estimators reweight observed trajectories by the ratio of target to behaviour policy probabilities, which is the sequential analogue of inverse probability weighting. Doubly robust and marginalised importance-sampling estimators carry over from the static causal inference literature and reduce the exponential variance of naive estimators in long horizons.

Software

Library	Language	Maintainer	Strength
DoWhy	Python	Microsoft Research (Sharma and Kiciman)	Unified four-step API; strong on assumption testing and refutation
EconML	Python	Microsoft ALICE	State-of-the-art CATE estimators (DML, doubly robust, deep IV)
CausalML	Python	Uber	Uplift modelling, meta-learners, neural-network estimators
causalgraphicalmodels	Python	(open source)	Pedagogical DAG manipulation and identifiability
grf	R	Athey, Tibshirani, Wager	Generalized random forests including causal forest
tmle	R	van der Laan group	Targeted maximum likelihood implementations
pcalg	R	Spirtes, Maathuis et al.	PC, FCI, GES and other discovery algorithms
Pyro CEVAE	Python	Uber AI Labs	Reference CEVAE implementation in a probabilistic programming language

DoWhy, EconML, and CausalML are commonly used together in production pipelines: DoWhy for stating and refuting the causal assumptions, EconML or CausalML for the heavy-machinery estimation step.^[33]

Practical pitfalls

Causal inference is unforgiving in a way that prediction is not. A model with bad predictions tells on itself in held-out error; a causal estimate built on the wrong assumptions can be confidently wrong with no internal warning sign.

Garbage in, garbage out. Causal claims are only as good as the assumptions they rest on, and most of those assumptions (no unmeasured confounders, positivity, SUTVA, parallel trends, exclusion restrictions) are untestable from the data alone. Sensitivity analyses, placebo tests, and cross-design replication exist precisely because point estimates without a credibility argument do not deserve to be believed. Athey and Imbens's 2017 Journal of Economic Perspectives review "The State of Applied Econometrics" is largely organised around this point.^[13]

Spurious correlations. A famous teaching example is Franz Messerli's 2012 New England Journal of Medicine note showing a strong national correlation between per-capita chocolate consumption and Nobel laureates per capita.^[32] The correlation is real; the causal story ("chocolate makes you smarter") is a joke that more than a few news outlets failed to get. The correlation is driven by national wealth, education systems, and many other common causes.

Simpson's paradox. Two variables can be positively associated overall, yet negatively associated in every subgroup, when there is a confounder. The 1973 University of California, Berkeley graduate admissions case is the canonical example: the headline acceptance rate looked biased against women, but within almost every department the rate was slightly higher for women; women applied to more competitive departments.^[34] Resolving the paradox requires deciding which causal model corresponds to the question being asked, and the resolution is different for different questions.

Berkson's paradox and collider bias. Conditioning on a common effect of two variables creates an association between them, even if they were originally independent. Hospital-based studies that condition implicitly on being sick enough to be admitted, dating-app data that conditions on appearing in a swipe pool, and academic studies that condition on having published produce non-causal correlations of exactly this kind.

Pearl's critique of pure deep learning. Pearl has repeatedly argued, including in "The Book of Why" and in subsequent writing, that systems that operate purely at the associational rung will not deliver causal answers no matter how much data they consume.^[2] The critique is not a denial of deep learning's value at rung 1; it is an argument that adding rungs 2 and 3 requires extra structure, and that pretending otherwise leads to confidently wrong policy advice. Whether and how this structure can be learned, rather than imposed, is one of the questions that animates causal representation learning today.^[20]

References

Pearl, J. (2009). *Causality: Models, Reasoning, and Inference*, 2nd ed. Cambridge University Press. https://bayes.cs.ucla.edu/BOOK-2K/ ↩
Pearl, J. and Mackenzie, D. (2018). *The Book of Why: The New Science of Cause and Effect*. Basic Books. ↩
Rubin, D. B. (1974). "Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies". *Journal of Educational Psychology* 66(5): 688-701. ↩
Splawa-Neyman, J. (1923, translated 1990). "On the Application of Probability Theory to Agricultural Experiments. Essay on Principles. Section 9". *Statistical Science* 5(4): 465-472. ↩
Imbens, G. W. and Rubin, D. B. (2015). *Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction*. Cambridge University Press. https://www.cambridge.org/core/books/causal-inference-for-statistics-social-and-biomedical-sciences/71126BE90C58F1A431FE9B2DD07938AB ↩
Hernan, M. A. and Robins, J. M. (2024). *Causal Inference: What If*. Chapman & Hall / CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/ ↩
Spirtes, P., Glymour, C., and Scheines, R. (2000). *Causation, Prediction, and Search*, 2nd ed. MIT Press. ↩
Spirtes, P. and Glymour, C. (1991). "An Algorithm for Fast Recovery of Sparse Causal Graphs". *Social Science Computer Review* 9(1): 62-72. ↩
Spirtes, P. (1995). "Directed Cyclic Graphical Representations of Feedback Models". *Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence*. ↩
Chickering, D. M. (2002). "Optimal Structure Identification with Greedy Search". *Journal of Machine Learning Research* 3: 507-554. ↩
Shimizu, S., Hoyer, P. O., Hyvarinen, A., and Kerminen, A. (2006). "A Linear Non-Gaussian Acyclic Model for Causal Discovery". *Journal of Machine Learning Research* 7: 2003-2030. https://www.jmlr.org/papers/volume7/shimizu06a/shimizu06a.pdf ↩
Zheng, X., Aragam, B., Ravikumar, P., and Xing, E. P. (2018). "DAGs with NO TEARS: Continuous Optimization for Structure Learning". *Advances in Neural Information Processing Systems* 31. https://arxiv.org/abs/1803.01422 ↩
Athey, S. and Imbens, G. W. (2017). "The State of Applied Econometrics: Causality and Policy Evaluation". *Journal of Economic Perspectives* 31(2): 3-32. https://www.aeaweb.org/articles?id=10.1257/jep.31.2.3 ↩
Wager, S. and Athey, S. (2018). "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests". *Journal of the American Statistical Association* 113(523): 1228-1242. ↩
Kunzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019). "Metalearners for Estimating Heterogeneous Treatment Effects using Machine Learning". *Proceedings of the National Academy of Sciences* 116(10): 4156-4165. ↩
Nie, X. and Wager, S. (2021). "Quasi-Oracle Estimation of Heterogeneous Treatment Effects". *Biometrika* 108(2): 299-319. ↩
Hahn, P. R., Murray, J. S., and Carvalho, C. M. (2020). "Bayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects". *Bayesian Analysis* 15(3): 965-1056. ↩
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). "Double/Debiased Machine Learning for Treatment and Structural Parameters". *The Econometrics Journal* 21(1): C1-C68. ↩
van der Laan, M. J. and Rose, S. (2011). *Targeted Learning: Causal Inference for Observational and Experimental Data*. Springer. ↩
Scholkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., and Bengio, Y. (2021). "Toward Causal Representation Learning". *Proceedings of the IEEE* 109(5): 612-634. https://arxiv.org/abs/2102.11107 ↩
Louizos, C., Shalit, U., Mooij, J., Sontag, D., Zemel, R., and Welling, M. (2017). "Causal Effect Inference with Deep Latent-Variable Models". *Advances in Neural Information Processing Systems* 30. https://arxiv.org/abs/1705.08821 ↩
Kusner, M. J., Loftus, J. R., Russell, C., and Silva, R. (2017). "Counterfactual Fairness". *Advances in Neural Information Processing Systems* 30. https://arxiv.org/abs/1703.06856 ↩
Imbens, G. W. and Angrist, J. D. (1994). "Identification and Estimation of Local Average Treatment Effects". *Econometrica* 62(2): 467-475. ↩
Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). "Identification of Causal Effects Using Instrumental Variables". *Journal of the American Statistical Association* 91(434): 444-455. ↩
Thistlethwaite, D. L. and Campbell, D. T. (1960). "Regression-Discontinuity Analysis: An Alternative to the Ex Post Facto Experiment". *Journal of Educational Psychology* 51(6): 309-317. ↩
Card, D. and Krueger, A. B. (1994). "Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania". *American Economic Review* 84(4): 772-793. ↩
Abadie, A. and Gardeazabal, J. (2003). "The Economic Costs of Conflict: A Case Study of the Basque Country". *American Economic Review* 93(1): 113-132. ↩
Abadie, A., Diamond, A., and Hainmueller, J. (2010). "Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California's Tobacco Control Program". *Journal of the American Statistical Association* 105(490): 493-505. ↩
Huang, Y. and Valtorta, M. (2006). "Pearl's Calculus of Intervention is Complete". *Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence*. https://arxiv.org/abs/1206.6831 ↩
Shpitser, I. and Pearl, J. (2006). "Identification of Joint Interventional Distributions in Recursive Semi-Markovian Causal Models". *Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI)*. ↩
Holland, P. W. (1986). "Statistics and Causal Inference". *Journal of the American Statistical Association* 81(396): 945-960. ↩
Messerli, F. H. (2012). "Chocolate Consumption, Cognitive Function, and Nobel Laureates". *New England Journal of Medicine* 367(16): 1562-1564. ↩
Sharma, A. and Kiciman, E. (2020). "DoWhy: An End-to-End Library for Causal Inference". https://arxiv.org/abs/2011.04216 ↩
Bickel, P. J., Hammel, E. A., and O'Connell, J. W. (1975). "Sex Bias in Graduate Admissions: Data from Berkeley". *Science* 187(4175): 398-404. ↩
Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). "'Why Should I Trust You?': Explaining the Predictions of Any Classifier". *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*. https://arxiv.org/abs/1602.04938 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

A/B Testing Bayesian network Counterfactual Fairness Fairness Metric Feature Importances Independently and Identically Distributed (i.i.d.)Inductive bias Judea Pearl Selection Bias Unawareness (Fairness Through Unawareness)Uplift Modeling

Why does causal inference matter for AI and machine learning?

Two main frameworks

Potential outcomes (Neyman-Rubin causal model)

Structural causal models (Pearl's approach)

How do the two frameworks differ?

Key concepts

What identification strategies are used in observational studies?

Randomised controlled trial

Adjustment for confounders

Instrumental variables

Regression discontinuity design

Difference-in-differences

Synthetic control method

Front-door adjustment

Causal discovery

How does causal inference intersect with modern machine learning?

Heterogeneous treatment effects

Doubly robust learners and TMLE

Causal representation learning

Large language models and causality

Causality in fairness

Causality in reinforcement learning and policy evaluation

Software

Practical pitfalls

See also

References

Improve this article

Related Articles

A/B Testing

Generalized Linear Model

L1 Loss

L2 Loss

Squared Loss

Stationarity

What links here

Related Articles

A/B Testing

Generalized Linear Model

L1 Loss

L2 Loss

Squared Loss

Stationarity

What links here