Bayesian network
Last reviewed
Apr 30, 2026
Sources
16 citations
Review status
Source-backed
Revision
v1 · 4,190 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
16 citations
Review status
Source-backed
Revision
v1 · 4,190 words
Add missing citations, update stale details, or suggest a clearer explanation.
A Bayesian network (also called a belief network, Bayes net, directed graphical model, or probabilistic causal network) is a probabilistic graphical model that represents a set of random variables and their conditional dependencies through a directed acyclic graph (DAG). Each node of the graph is a random variable, and each directed edge represents a direct probabilistic dependency, typically read as the parent influencing the child. The graph encodes a joint probability distribution as a product of local conditional distributions: P(X_1, ..., X_n) = ∏ P(X_i | parents(X_i)). This factorisation, together with the conditional-independence assumptions implied by the graph, lets a model with potentially exponential joint complexity be specified, stored, and queried in space and time that scale with the local structure rather than the full state space.[1][2]
The modern formulation of Bayesian networks was developed in the 1980s by Judea Pearl and collaborators at UCLA, who unified prior work on probabilistic expert systems, influence diagrams, and tree-based propagation into the framework laid out in Pearl's 1988 textbook Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.[1] Together with the junction-tree algorithm of Lauritzen and Spiegelhalter (1988),[3] the Bayesian-network formalism became the standard tool for reasoning about uncertainty in artificial intelligence throughout the 1990s. It later provided the substrate on which Pearl built his theory of causal inference and the do-calculus, presented in Causality: Models, Reasoning, and Inference (2nd ed., 2009).[4] The canonical reference for the broader field of probabilistic graphical models is Koller and Friedman's 2009 textbook Probabilistic Graphical Models: Principles and Techniques.[2]
Bayesian networks are closely related to, and in many cases generalise, classical models such as naive Bayes classifiers, Hidden Markov Models, and Kalman filters. They remain in active use for medical diagnosis, fault diagnosis, gene regulatory network inference, risk assessment, decision support, and any setting where structured uncertainty, expert knowledge, and interpretability matter more than raw predictive accuracy.
A Bayesian network over variables X_1, ..., X_n consists of two parts. The first is a directed acyclic graph G whose nodes are the variables. The second is, for each node X_i, a conditional probability distribution P(X_i | parents(X_i)) that quantifies how the variable depends on its parents in G. For discrete variables, this distribution is usually represented as a conditional probability table (CPT). For continuous variables, it is a conditional density, often a linear-Gaussian or a noisy-OR model.[1][2]
The central claim of the formalism is that the joint distribution factorises as
P(X_1, ..., X_n) = ∏_{i=1..n} P(X_i | parents(X_i))
This factorisation is a direct consequence of the chain rule of probability combined with the local Markov assumption: each variable is conditionally independent of its non-descendants given its parents. The factorisation has two practical consequences. First, the model is parameter-frugal: a node with k binary parents needs 2^k probabilities, not 2^n. Second, queries that respect the graph structure can often be answered by message-passing on the graph rather than by enumerating the full joint table.[2]
The roots of Bayesian networks lie in 1960s and 1970s work on probabilistic reasoning in expert systems, which struggled with the combinatorial explosion of joint distributions over many variables. Several lines of research converged in the early 1980s. In statistics, contingency-table analysis and the theory of graphical log-linear models, developed by Darroch, Lauritzen, and Speed, established that conditional independence could be read off from a graph. In artificial intelligence, the limitations of certainty factors and ad-hoc rule-based reasoning under uncertainty were becoming apparent, and Pearl, Spiegelhalter, Cooper, Henrion, and others were searching for a coherent probabilistic alternative.[1]
Judea Pearl's 1986 paper introduced belief propagation as an exact inference algorithm for poly-trees, and his 1988 textbook Probabilistic Reasoning in Intelligent Systems consolidated the field, defining d-separation, message passing, and the semantics of belief networks.[1] In parallel, Lauritzen and Spiegelhalter's 1988 paper in the Journal of the Royal Statistical Society introduced the junction-tree (also called clique-tree or join-tree) algorithm, which extended exact inference to general DAGs by triangulating the moralised graph.[3] The 1990s saw a rapid expansion: Cooper proved in 1990 that exact inference is NP-hard,[5] which motivated the development of approximate inference methods including Markov chain Monte Carlo, variational approximations, and loopy belief propagation. Software tools such as Hugin (a commercialisation of the Lauritzen-Spiegelhalter framework), Netica, and Pearl's own Bayes Net Toolbox emerged in this period.
In the 2000s the focus shifted toward learning. Heckerman's 1995 Microsoft Research tutorial gave a Bayesian treatment of parameter and structure learning that influenced a generation of researchers,[6] and the Friedman, Geiger, and Goldszmidt 1997 paper introduced Tree-Augmented Naive Bayes (TAN) as a competitive classifier.[7] Pearl's 2000/2009 Causality recast Bayesian networks as a language for causal reasoning, introducing the do-operator and the calculus of interventions. The 2010s and 2020s have seen Bayesian networks subsumed into the broader frameworks of probabilistic programming (Stan, PyMC, Pyro, NumPyro), causal discovery, and neuro-symbolic AI.[2][4]
For each node, a Bayesian network specifies a conditional distribution given its parents. With discrete variables this is a CPT whose rows are parent configurations and whose columns are the possible values of the child. For continuous variables, common parametric forms include linear-Gaussian (the child is a linear function of its parents plus Gaussian noise) and conditional-Gaussian for hybrid networks that mix discrete and continuous variables.[2]
D-separation is a graph-theoretic criterion, introduced by Pearl, that determines which conditional independencies hold in a Bayesian network purely from its graph structure. A path between two nodes is blocked by a conditioning set Z if it contains a chain or fork node that is in Z, or a collider (a node with two converging arrows) that is neither in Z nor has a descendant in Z. If every path between X and Y is blocked by Z, then X and Y are d-separated by Z, and the network entails X ⊥ Y | Z.[1] D-separation is sound (every d-separated pair is conditionally independent in every distribution that factorises over the graph) and is a complete graphical characterisation of the independencies entailed by the structure alone.
The Markov blanket of a node X consists of its parents, its children, and the other parents of those children (sometimes called spouses or co-parents). Conditional on its Markov blanket, X is independent of every other node in the network. This locality property is what makes Gibbs sampling and other local update schemes practical: each variable can be resampled using only the values of its blanket.[1][2]
The joint factorisation P(X_1, ..., X_n) = ∏ P(X_i | parents(X_i)) follows from the chain rule of probability under any topological ordering, combined with the local Markov property that says a node is independent of its non-descendants given its parents. Choosing a different ordering can yield a different graph, and minimal Bayesian networks (those with the fewest edges that respect a given independence structure) generally depend on ordering, although they are unique up to a class of equivalent DAGs sharing the same skeleton and v-structures.
Given a Bayesian network, the canonical inference problem is to compute the posterior distribution of one set of variables (queries) given values for another set (evidence), marginalising over the rest. Several families of algorithms exist; the choice depends on graph structure, the number of variables, and accuracy requirements.
| Method | Type | Origin | Notes |
|---|---|---|---|
| Belief propagation | Exact on trees | Pearl 1986; Pearl 1988 | Exact for poly-trees; "loopy BP" is an approximation on general graphs |
| Variable elimination | Exact | Zhang and Poole 1994; Dechter 1996 | Sums out non-query variables one at a time; complexity exponential in induced width |
| Junction tree (clique tree) | Exact | Lauritzen and Spiegelhalter 1988[3] | Polynomial in the tree-width of the moralised, triangulated graph |
| Cutset conditioning | Exact | Pearl 1988 | Conditions on a small set to break loops, then runs poly-tree BP on each instantiation |
| Gibbs sampling | Approximate, sampling | Geman and Geman 1984; Pearl 1987 | Uses Markov-blanket conditionals; basis of WinBUGS |
| Metropolis-Hastings | Approximate, sampling | Metropolis et al. 1953; Hastings 1970 | General-purpose MCMC |
| Importance sampling, likelihood weighting | Approximate, sampling | Fung and Chang 1989; Shachter and Peot 1989 | Useful when evidence is unlikely under the prior |
| Mean-field variational inference | Approximate, deterministic | Saul, Jaakkola, Jordan 1996 | Posits a factorised approximation; minimises KL divergence |
| Structured variational methods | Approximate, deterministic | Saul and Jordan 1996; Wainwright and Jordan 2008 | Trades accuracy for tractability through structured families |
| Loopy belief propagation | Approximate | Murphy, Weiss, Jordan 1999 | Belief propagation applied to general graphs; often empirically accurate |
Gregory Cooper proved in 1990 that exact inference in general Bayesian networks is NP-hard, even for restricted topologies.[5] Dagum and Luby (1993) later showed that approximating posterior probabilities to within a constant factor is also NP-hard in the worst case. The intractability is not merely theoretical: real medical-diagnosis networks such as QMR-DT (Quick Medical Reference, Decision-Theoretic) have hundreds of disease nodes and thousands of finding nodes, making exact inference infeasible and motivating the use of variational and sampling approximations.
For structured graphs the picture is much better. Trees admit linear-time exact inference through belief propagation. Graphs with bounded tree-width admit polynomial-time exact inference through the junction-tree algorithm, with complexity O(n · 2^w) for binary variables, where w is the tree-width.[3] In practice this means that networks with many variables but limited induced connectivity remain tractable.
A full learning problem combines parameter learning (filling in the CPTs) with structure learning (choosing the DAG itself). When the structure is known and data is complete, parameter learning reduces to counting; otherwise it can become very hard.
With complete data and a fixed structure, maximum-likelihood estimation of CPT entries reduces to normalised counts of joint occurrences in the data. The Bayesian alternative places Dirichlet priors on the rows of each CPT and yields posterior CPT entries as posterior expectations of the corresponding multinomial parameters; the conjugate-prior structure makes this update a closed-form add-and-normalise operation. With incomplete data or hidden variables, the standard tool is the expectation-maximisation (EM) algorithm, applied in this setting by Lauritzen (1995) and others. Structural EM (Friedman 1997) extends EM to alternate between parameter and structure updates when there are latent variables.[6]
Learning the DAG itself is harder. Chickering (1996) showed that finding the network structure that maximises a Bayesian score is NP-hard. Three main families of algorithms exist.
| Family | Representative algorithms | Idea |
|---|---|---|
| Score-based | Greedy hill-climbing, tabu search, simulated annealing, GES (Greedy Equivalence Search) | Optimise a goodness-of-fit score (BIC, AIC, BDe, MDL) over the space of DAGs or equivalence classes |
| Constraint-based | PC algorithm (Spirtes and Glymour 1991); FCI (Fast Causal Inference) | Use conditional-independence tests to discover the skeleton, then orient edges using v-structures and orientation rules[8] |
| Hybrid | MMHC (Tsamardinos, Brown, Aliferis 2006)[9]; Sparse Candidate (Friedman et al. 1999) | Use independence tests to restrict candidate parents, then score-optimise within that restricted space |
Score-based methods rely on decomposable scores such as the Bayesian Dirichlet equivalent (BDe) score introduced by Heckerman, Geiger, and Chickering (1995),[6] BIC (Bayesian information criterion), or AIC (Akaike information criterion). Bayesian model averaging integrates predictions over the posterior distribution on graphs rather than picking a single best graph, which can be useful when many structures fit the data nearly equally well.
Constraint-based methods like the PC algorithm[8] start from a complete undirected graph and remove edges between variables found to be conditionally independent given some subset of the others. Surviving edges are then oriented using v-structures (colliders) and Meek's rules. The Fast Causal Inference (FCI) algorithm extends PC to handle latent confounders and selection bias, returning a partial ancestral graph rather than a single DAG.
The MMHC algorithm[9] combines a constraint-based local skeleton-discovery phase (Max-Min Parents and Children) with a hill-climbing search over DAGs restricted to that skeleton. Empirical comparisons in Tsamardinos, Brown, and Aliferis (2006) found it to outperform PC, GES, and several other algorithms across many benchmarks.
The simplest Bayesian network of practical interest is the naive Bayes classifier, in which a single class node C is the parent of every feature node X_i, and the features are assumed conditionally independent given the class. Despite its strong independence assumption, naive Bayes is often competitive with much more sophisticated classifiers, especially for text classification.[7]
Friedman, Geiger, and Goldszmidt (1997) introduced TAN as a relaxation of naive Bayes that allows each feature to depend on at most one other feature in addition to the class.[7] The optimal tree-shaped augmentation can be found in polynomial time using a maximum-spanning-tree algorithm on the conditional mutual information between features. TAN typically improves classification accuracy over naive Bayes while preserving most of its computational simplicity.
Dynamic Bayesian networks (DBNs) extend the framework to time-series data. Introduced by Dean and Kanazawa in their 1989 paper A Model for Reasoning about Persistence and Causation in Computational Intelligence,[10] a DBN consists of a prior network B_1 over the variables at time 1 and a transition network B_→ that defines the conditional distribution of the variables at time t+1 given those at time t. Standard inference tasks in DBNs include filtering (computing the posterior over the current state given past evidence), smoothing (over a past state given all evidence), prediction (over a future state), and most-likely-explanation. Hidden Markov Models and Kalman filters are special cases of DBNs.
Influence diagrams extend Bayesian networks with two additional node types: decision nodes representing controllable choices and utility nodes representing the agent's payoff. The resulting model supports decision-theoretic reasoning under uncertainty and is the basis of normative decision-support systems in medicine, business, and engineering.
Markov random fields (MRFs) are the undirected counterpart of Bayesian networks. They factorise the joint distribution as a product of non-negative potential functions over cliques rather than as a product of conditional distributions over node-parent pairs. Factor graphs provide a unifying bipartite representation in which both Bayesian networks and MRFs are special cases, and the sum-product and max-product algorithms operate on factor graphs.[2]
A causal Bayesian network is a Bayesian network whose edges are interpreted as direct causal relationships rather than mere statistical associations. Pearl developed the framework in the 1990s and consolidated it in Causality: Models, Reasoning, and Inference (2nd ed., 2009).[4]
The key new operation is the do-operator. P(Y | do(X = x)) denotes the distribution of Y after an external intervention sets X to x, which is generally different from the conditional distribution P(Y | X = x). Graphically, do(X = x) corresponds to a mutilated network in which all incoming edges to X are removed and X is set to x. The do-calculus, a set of three rules introduced by Pearl, gives sufficient conditions for rewriting interventional queries in terms of observational ones. Shpitser and Pearl (2006) and Tian and Pearl (2002) gave a complete identifiability algorithm: an interventional distribution is identifiable from observational data and a causal DAG if and only if their algorithm returns a do-free expression.
Causal Bayesian networks underpin much of modern causal inference and are connected to Bayesian-network structure learning through algorithms such as PC and FCI, which try to recover the causal graph (or its Markov equivalence class) from data alone under faithfulness assumptions.[8]
Bayesian networks are most useful in domains that involve structured uncertainty, expert knowledge that needs to be encoded explicitly, and a need for interpretability. The following table lists historically and currently important applications.
| Application | System / domain | Notes |
|---|---|---|
| Medical diagnosis | MUNIN (Andreassen et al., 1989) | Causal probabilistic network for interpretation of electromyographic findings; one of the first large medical Bayesian networks |
| Medical diagnosis | Pathfinder (Heckerman, 1992) | Diagnosis of lymph-node diseases; over 100 features and 60+ diseases; commercialised as Intellipath |
| Medical diagnosis | QMR-DT | Decision-theoretic version of Quick Medical Reference; bipartite network with hundreds of diseases and thousands of findings |
| User modelling | Lumiere (Microsoft Research, 1993-1998) | Bayesian user-modelling system that became the basis for the Office Assistant ("Clippy") in Microsoft Office 97 |
| Email filtering | Bayesian spam filters (e.g., SpamAssassin, Mozilla) | Naive-Bayes classifiers over word-level features; widely deployed before transformer-based filters |
| Defence and intelligence | Pearl Harbor reconstruction, missile-defence systems | Bayesian fusion of uncertain sensor data |
| Bioinformatics | Gene regulatory network inference | Dynamic Bayesian networks applied to time-series gene-expression data; models genetic regulatory networks |
| Risk and reliability | Industrial fault diagnosis, nuclear safety | Encodes failure modes and diagnostic findings in interpretable form |
| Speech recognition | HMM-based ASR (pre-deep-learning) | Hidden Markov Models, special-case DBNs, dominated speech recognition through the 1990s and 2000s |
| Tracking | Kalman-filter-based vehicle and pedestrian tracking | Linear-Gaussian DBNs applied to sensor-fusion problems |
| Computer vision | Pre-deep-learning image segmentation, object recognition | Bayesian networks and MRFs were used for stereo, segmentation, and parts-based recognition |
| Forensics | DNA mixture interpretation | Networks like those in TrueAllele encode the relationship between genotypes, observed allele peaks, and evidence |
| Tool | Language / platform | Notes |
|---|---|---|
| Bayes Net Toolbox (BNT) | MATLAB | Written by Kevin Murphy from 1997 onwards; long-standing reference implementation |
| Hugin | Commercial; APIs in many languages | Direct descendant of the Lauritzen-Spiegelhalter junction-tree work |
| Netica | Commercial (Norsys) | Widely used in industry and education |
| GeNIe / SMILE | Free GUI / commercial library; from BayesFusion (formerly U. Pittsburgh) | Influence diagrams and Bayesian networks |
| SamIam | Java GUI; UCLA Automated Reasoning Group | Research-oriented, exact inference and sensitivity analysis |
| pgmpy | Python | Open-source library for discrete and continuous Bayesian networks |
| bnlearn | R | Implements PC, GS, Hill-Climbing, MMHC, and other learners |
| daft | Python | A drawing library for plate-notation graphical models |
| Stan, PyMC, Pyro, NumPyro | Probabilistic programming languages | Generalise Bayesian networks; support continuous variables, MCMC, and variational inference |
| Model | Directed? | Factorisation | Inference complexity | Typical use |
|---|---|---|---|---|
| Bayesian network | Yes (DAG) | Product of conditionals P(X_i | parents) | Polynomial in tree-width; NP-hard in general[5] | Diagnosis, causal reasoning, expert systems |
| Markov random field (MRF) | No | Product of clique potentials | Polynomial in tree-width; NP-hard in general | Image segmentation, statistical physics |
| Factor graph | Bipartite | Product of factors over variable subsets | Same as underlying graph | Generic message-passing, including LDPC codes |
| Hidden Markov Model (HMM) | Yes (chain DAG) | P(X_t | X_{t-1}) P(O_t | X_t) | Linear in chain length (forward-backward) | Speech recognition, bioinformatics |
| Dynamic Bayesian network | Yes | Time-extended Bayesian network[10] | Polynomial in tree-width per slice | Tracking, time-series modelling |
| Conditional random field (CRF) | Undirected | Conditional log-linear model P(Y | X) | Polynomial in tree-width | Sequence labelling (POS tagging, NER) |
| Influence diagram | Mixed (DAG + decision and utility nodes) | Bayesian network + decision and utility | Polynomial in tree-width for fixed strategies | Decision support |
A recurring source of confusion is the relationship between Bayesian networks and Bayesian inference. The two ideas are related but distinct. Bayesian inference is a general philosophy of statistical reasoning in which beliefs are represented as probability distributions and updated via Bayes' theorem; it can be applied to any model, Bayesian network or not. A Bayesian network is a particular representation for a joint probability distribution, named after Bayes' theorem rather than committed to Bayesian-versus-frequentist methodology. CPTs in a Bayesian network can be estimated by maximum likelihood (a frequentist procedure), and inference of the posterior over query variables given evidence can be performed by purely probabilistic message-passing without invoking any prior. Conversely, Bayesian methods can place priors over the CPT entries, the structure, or both, and combine them through the same network machinery; this is the route that Heckerman, Geiger, and Chickering took with the BDe score.[6]
Through the 2010s, deep neural networks came to dominate most large-scale perception and language tasks: image classification, speech recognition, machine translation, and language modelling. Bayesian networks were ill-suited to those problems for several reasons. They scale badly with the number of variables once the graph is dense. They typically require expert knowledge or careful structure learning, both of which are awkward for raw pixels or characters. Their generative semantics, while a strength for diagnosis, is a weakness for discriminative tasks where end-to-end optimisation of conditional likelihood is what matters.
At the same time, Bayesian networks did not disappear. They remain the preferred tool for problems with the following characteristics:
Probabilistic programming languages (Stan, PyMC, Pyro, NumPyro) generalise Bayesian networks to arbitrary stochastic computations and provide automatic inference engines based on Hamiltonian Monte Carlo and variational inference. They are best understood as the modern descendants of the BNT and Hugin tradition rather than as replacements for graphical models.
Several contemporary research currents have brought Bayesian networks back into the spotlight, often under different names. Causal discovery, building on Spirtes-Glymour, Pearl, and others, has become a substantial subfield with applications in genomics, economics, and reinforcement learning. Probabilistic programming systems treat Bayesian networks as a special case of generative models written in code. Neuro-symbolic AI combines neural networks with discrete graphical models, often using neural networks to parameterise the conditional distributions of a Bayesian network or to amortise variational posteriors. And explainable-AI work in high-stakes domains continues to value the transparent, decomposable structure of Bayesian networks over the opaque computations of large neural models.