Bayes' theorem
Last reviewed
Apr 30, 2026
Sources
32 citations
Review status
Source-backed
Revision
v2 · 4,367 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
32 citations
Review status
Source-backed
Revision
v2 · 4,367 words
Add missing citations, update stale details, or suggest a clearer explanation.
Bayes' theorem (also called Bayes' rule or Bayes' law) is a fundamental theorem of probability theory that describes how to update the probability of a hypothesis given new evidence. It expresses the posterior probability of a proposition as proportional to the product of its prior probability and the likelihood of the observed data. In its simplest form the theorem states that P(A | B) = P(B | A) * P(A) / P(B), where P(B) > 0.[1][2]
Named after the Reverend Thomas Bayes (1701-1761), an English Presbyterian minister and amateur mathematician, the theorem was published posthumously in 1763 by his friend Richard Price, who edited Bayes's manuscript and submitted it to the Royal Society of London.[3][4] The theorem was independently rediscovered and substantially generalized by the French mathematician Pierre-Simon Laplace in 1774 and developed further in his 1812 treatise Théorie analytique des probabilités.[5]
Despite its simple algebraic form, Bayes' theorem provides the foundation for an entire school of statistical reasoning known as Bayesian statistics or Bayesian inference. It underlies medical diagnosis, spam classification, search and rescue operations, scientific model comparison, and large parts of contemporary machine learning. Algorithms from naive Bayes classifiers and Bayesian networks to Markov chain Monte Carlo samplers and variational inference all flow from the rule first written down in Bayes's 1763 essay.[6][7]
Let A and B be two events with P(B) > 0. Bayes' theorem states P(A | B) = P(B | A) * P(A) / P(B). The equation expresses the probability of A after observing B (the posterior) in terms of the prior probability of A, the likelihood of observing B under A, and the total probability of B. Each piece plays a distinct role in Bayesian reasoning.
| Symbol | Name | Interpretation |
|---|---|---|
| `P(A | B)` | Posterior probability |
| `P(B | A)` | Likelihood |
P(A) | Prior probability | Probability of A before observing the evidence |
P(B) | Marginal likelihood | Total probability of B over all hypotheses; the normalizing constant |
In practice P(B) is often hard to compute directly, and the theorem is used in its proportional form P(A | B) ∝ P(B | A) * P(A). The constant is recovered by requiring posterior probabilities to sum (or integrate) to one over all hypotheses.[6]
Bayes' theorem follows directly from the definition of conditional probability. For events A and B with P(A) > 0 and P(B) > 0:
P(A | B) = P(A ∩ B) / P(B) and P(B | A) = P(A ∩ B) / P(A)
Multiplying the second equation by P(A) gives P(A ∩ B) = P(B | A) * P(A). Substituting into the first equation yields P(A | B) = P(B | A) * P(A) / P(B). The derivation is symmetric in A and B: Bayes' theorem is really a statement about how to invert a conditional probability. Given a forward model that tells us how likely the evidence is under each hypothesis, the theorem flips the conditioning to reason from evidence back to hypothesis.[1][2]
Bayes' theorem can be extended in several directions useful in applications.
When the hypothesis space is partitioned into mutually exclusive and exhaustive events {A_1, A_2, ..., A_n}, the marginal likelihood expands by the law of total probability:
P(B) = sum_i P(B | A_i) * P(A_i)
Substituting gives the form used in nearly every elementary probability course:
P(A_k | B) = P(B | A_k) * P(A_k) / sum_i P(B | A_i) * P(A_i)
This is the version Laplace wrote down in his 1774 memoir.[5][8]
For continuous random variables with parameter θ and observed data x, the discrete sum is replaced by an integral and probabilities by densities:
f(θ | x) = f(x | θ) * f(θ) / ∫ f(x | θ) * f(θ) dθ
This form is the workhorse of modern Bayesian statistics. Here f(x | θ) is the likelihood, f(θ) is the prior density, and the denominator is the marginal likelihood. Computing the integral is often intractable for high-dimensional θ, motivating the use of MCMC and variational inference methods.[6][7]
Dividing the posterior probability of a hypothesis by the posterior of its negation eliminates the marginal likelihood:
P(A | B) / P(¬A | B) = ( P(B | A) / P(B | ¬A) ) * ( P(A) / P(¬A) )
In words, posterior odds equal the likelihood ratio multiplied by prior odds. The ratio P(B | A) / P(B | ¬A) is the Bayes factor for A against ¬A, and it summarizes how strongly the evidence favors A. Bayes factors are widely used in model comparison because they do not require priors over which model is correct.[9]
Thomas Bayes was born in London, probably in 1701, though the date is uncertain and some sources give 1702. He was the son of Joshua Bayes, one of the first six Nonconformist ministers publicly ordained in England after the Act of Toleration. Thomas Bayes became a Presbyterian minister and from around 1733 served the congregation at Tunbridge Wells in Kent. He was elected a Fellow of the Royal Society in 1742 for mathematical work that included a defence of Newton's calculus against Bishop Berkeley.[3][10]
Bayes did not publish his most famous result during his lifetime. After his death on 7 April 1761, his friend Richard Price (1723-1791) received Bayes's mathematical papers, among them an essay titled "An Essay towards solving a Problem in the Doctrine of Chances." Price edited the manuscript, added an introduction and appendix, and submitted it to the Royal Society. It was read on 23 December 1763 and published in 1764 in Philosophical Transactions of the Royal Society of London, volume 53, pages 370-418.[1][3] Bayes framed the problem geometrically, using a thought experiment involving balls thrown onto a square table, and solved a special case with a uniform prior on the unit interval. The general formulation now called Bayes' theorem was extracted from his work by later mathematicians, especially Laplace.[4][11]
The French mathematician Pierre-Simon Laplace (1749-1827) independently arrived at the principle of inverse probability in his 1774 memoir "Mémoire sur la probabilité des causes par les événemens," without knowing of Bayes's earlier essay. He stated what we would now call Bayes' theorem in essentially its modern general form, applied it to problems including the estimation of the masses of Saturn and Jupiter from astronomical observations, and used it as the core engine of his probability theory.[5][8] Laplace's monumental Théorie analytique des probabilités, first published in 1812, developed the Bayesian approach so thoroughly that for most of the nineteenth century the method was known as the "probability of causes" or "inverse probability" and was attributed primarily to Laplace. The modern name "Bayes' theorem" became standard only in the early twentieth century.[5][12]
In the early twentieth century, Bayesian methods fell out of favor. Ronald A. Fisher, Jerzy Neyman, and Egon Pearson developed an alternative framework based on long-run frequencies of repeated experiments, including significance tests, confidence intervals, and hypothesis testing. Fisher was sharply critical of subjective priors. The frequentist school dominated mainstream statistics from the 1920s through the 1950s.[12][13]
A Bayesian revival began in the 1950s, driven by mathematicians and philosophers who argued that probability could legitimately represent degrees of belief. Key figures included Harold Jeffreys, whose 1939 Theory of Probability gave a comprehensive Bayesian treatment with objective priors; Bruno de Finetti, who developed a subjective theory and proved his representation theorem for exchangeable sequences; Leonard J. Savage, whose 1954 The Foundations of Statistics axiomatized subjective probability; and I. J. Good, who applied Bayesian methods to cryptanalysis at Bletchley Park.[12][14]
A practical revival came in the late 1980s and early 1990s with cheap computing and the rediscovery of Markov chain Monte Carlo methods. The Metropolis algorithm dated from 1953 and Hastings's generalization from 1970, but it was the 1990 paper by Alan Gelfand and Adrian Smith on Gibbs sampling that brought MCMC into mainstream statistics. Integrals that had previously made Bayesian analysis intractable could now be approximated numerically, and Bayesian methods spread rapidly through statistics, machine learning, and the empirical sciences.[12][15]
The classical illustration is a diagnostic test. Suppose a disease has 1 percent prevalence, so the prior P(D) = 0.01. A blood test has sensitivity (true positive rate) P(+ | D) = 0.99 and specificity (true negative rate) P(- | ¬D) = 0.99, implying a false positive rate P(+ | ¬D) = 0.01. A patient tests positive. What is the probability they actually have the disease?
Intuition often suggests an answer near 99 percent, but Bayes' theorem gives a very different result. Apply the theorem with the law of total probability:
P(D | +) = P(+ | D) * P(D) / ( P(+ | D) * P(D) + P(+ | ¬D) * P(¬D) )
Substituting the numerical values:
P(D | +) = (0.99 * 0.01) / ( 0.99 * 0.01 + 0.01 * 0.99 ) = 0.0099 / 0.0198 = 0.5
A positive test result raises the probability of disease only to 50 percent. With 1 percent prevalence the population contains far more healthy people than sick people, so even a 1 percent false positive rate produces as many false positives as the test produces true positives. This is the textbook illustration of the base rate fallacy and is why a positive screening result usually triggers a confirmatory test rather than an immediate diagnosis.[16][17] If the same test is applied in a high-risk population where prevalence is 50 percent, the calculation gives P(D | +) = 0.99. The diagnostic value of a test depends on the prior probability that the patient has the disease, not only on its sensitivity and specificity.
A classic introductory exercise: Jar 1 contains 30 vanilla and 10 chocolate cookies; Jar 2 contains 20 vanilla and 20 chocolate cookies. A jar is chosen uniformly at random and a cookie is drawn. The cookie is vanilla. What is the probability it came from Jar 1?
With priors P(J1) = P(J2) = 0.5 and likelihoods P(V | J1) = 0.75, P(V | J2) = 0.5, Bayes' theorem gives P(J1 | V) = (0.75 * 0.5) / (0.75 * 0.5 + 0.5 * 0.5) = 0.375 / 0.625 = 0.6. The vanilla cookie raises the probability the jar was Jar 1 from 0.5 to 0.6.[2][18]
In the Monty Hall problem, a contestant on a game show is shown three doors. Behind one door is a car and behind the other two are goats. The contestant picks a door, after which the host, who knows where the car is, opens one of the remaining doors to reveal a goat and offers the contestant the chance to switch. Should they switch?
Let C_i denote the event that the car is behind door i. Suppose the contestant picks door 1 and the host opens door 3. By Bayes' theorem P(C_2 | H_3) = P(H_3 | C_2) * P(C_2) / P(H_3). If the car is behind door 2 the host must open door 3, so P(H_3 | C_2) = 1. If the car is behind door 1 the host chooses door 2 or 3 uniformly, so P(H_3 | C_1) = 0.5. If the car is behind door 3 the host cannot open it, so P(H_3 | C_3) = 0. The priors are all 1/3. The marginal P(H_3) = 1/2, and the posterior is P(C_2 | H_3) = (1 * 1/3) / (1/2) = 2/3. Switching doubles the contestant's probability of winning the car, from 1/3 to 2/3.[19]
Bayes' theorem underlies the simplest and most influential family of spam filtering algorithms. A naive Bayes spam filter treats an email as a bag of words and computes the posterior probability that the email is spam given the words it contains. For a single word w and hypotheses S (spam) and H (ham):
P(S | w) = P(w | S) * P(S) / ( P(w | S) * P(S) + P(w | H) * P(H) )
The filter estimates P(w | S) and P(w | H) from word frequencies in a labeled training corpus. To handle a whole message, the filter assumes words occur independently given the class, the "naive" assumption that gives naive Bayes its name. The assumption is clearly false in real text, but the resulting classifiers are surprisingly accurate, fast to train, and easy to update online. Paul Graham's 2002 essay "A Plan for Spam" popularized the approach and drove a step change in commercial filtering.[20][21]
The arithmetic of Bayes' theorem is elementary, but applying it correctly under stress is famously difficult. Several reasoning errors have acquired their own names.
The base rate fallacy consists of ignoring the prior probability of a hypothesis and treating the likelihood as if it were the posterior. People who hear that a test is 99 percent accurate often conclude that a positive result implies a 99 percent chance of disease, neglecting the small base rate. Daniel Kahneman and Amos Tversky documented many variants of this error in their work on heuristics and biases.[16][22]
The prosecutor's fallacy confuses P(evidence | innocent) with P(innocent | evidence). A prosecutor may argue that because the probability of a particular DNA match in an innocent person is one in a million, the probability that the defendant is innocent given the match is one in a million. This conflation only holds when prior probabilities are equal, which is rarely the case. The prosecutor's fallacy was named in a 1987 paper by William Thompson and Edward Schumann and has been blamed for several miscarriages of justice.[23]
The defendant's fallacy is the mirror image: ignoring evidence on the grounds that many people would also match. A defense attorney might argue that since one million people share a particular DNA profile fragment, the evidence is worthless. This neglects how the suspect population is restricted by other circumstances.[24]
A prominent real-world case is the wrongful 1999 conviction of Sally Clark in the United Kingdom for the murder of her two infant sons. Expert witness Sir Roy Meadow testified that the probability of two cot deaths in one family was about 1 in 73 million, computed by squaring the rate of a single cot death. The figure assumed independence between sibling deaths, an assumption now known to be wrong, and conflated P(two deaths | innocent) with P(innocent | two deaths). The Royal Statistical Society publicly criticized the reasoning. Clark's conviction was overturned in 2003, after she had served more than three years in prison.[25][26]
Bayes' theorem is the engine of Bayesian inference, an approach in which all unknowns are treated as random variables and probability is interpreted as a degree of belief. The analyst begins with a prior p(θ) over the unknown parameter θ. After observing data x with likelihood p(x | θ), the prior is updated to a posterior p(θ | x) ∝ p(x | θ) * p(θ). The posterior is the complete answer to the inference problem and is used for parameter estimation, prediction, and decision making.[6][7]
When the prior and likelihood belong to suitable families, the posterior takes the same functional form as the prior, and the prior is called a conjugate prior. Classic examples include the beta-binomial, gamma-Poisson, and normal-normal pairs. Conjugate analyses give closed-form posteriors and were the dominant style of Bayesian inference before the spread of MCMC.[6]
From the posterior one can summarize uncertainty with a credible interval, defined so that the parameter lies inside the interval with a specified posterior probability such as 95 percent. Credible intervals look similar to frequentist confidence intervals but have a fundamentally different interpretation. A 95 percent credible interval is a set of parameter values containing 95 percent of posterior probability. A 95 percent confidence interval is a procedure that, in repeated use, would contain the true parameter 95 percent of the time. The two can coincide numerically in simple problems but in general answer different questions.[6][27]
When the posterior cannot be computed in closed form, modern Bayesian inference relies on numerical methods. Markov chain Monte Carlo algorithms, including Metropolis-Hastings and Gibbs sampling, produce posterior samples by constructing a Markov chain whose stationary distribution is the posterior. Variational inference instead approximates the posterior by the closest member of a tractable family under the Kullback-Leibler divergence.[15][28]
The Bayesian approach contrasts with the frequentist school, which interprets probability as the long-run relative frequency of an event in a hypothetical infinite sequence of trials and treats parameters as fixed unknowns rather than random variables. The frequentist toolkit includes maximum likelihood estimation, hypothesis testing with p-values, and confidence intervals derived from the sampling distribution of an estimator.[12][13]
The philosophical divide centers on the meaning of probability. Bayesians treat probability as a numerical measure of degree of belief, applicable to any proposition for which an agent has uncertainty, including one-off events such as the outcome of a particular election. Frequentists restrict probability to events that can in principle be embedded in a sequence of repeatable trials and reject the notion of probability for parameters that are simply unknown rather than random. The Bayesian framework requires priors, which can be subjective. The frequentist framework avoids priors but at the cost of more circuitous interpretations of inferential statements.[12][14]
The debate was sometimes acrimonious in the twentieth century, with Fisher dismissing Bayesian methods as unscientific. Since the 2000s the dispute has cooled. Most working statisticians and machine learning researchers use whichever framework better fits the problem, and methods including hierarchical models, empirical Bayes, and regularized regression blur the boundary. The two approaches often agree numerically when the prior is weak and the data are strong, and tend to disagree primarily in small-sample or highly structured problems.[12][29]
Bayes' theorem and its consequences pervade machine learning. The table below lists important methods that derive directly from Bayesian reasoning.
| Method | Idea | Typical use |
|---|---|---|
| Naive Bayes classifiers | Apply Bayes' theorem assuming features are conditionally independent given the class | Text classification, spam filtering |
| Bayesian networks | Directed acyclic graphs encoding conditional independence | Diagnostic systems, causal modeling |
| Markov chain Monte Carlo | Markov chains whose stationary distribution is the posterior | Posterior sampling, hierarchical models |
| Metropolis-Hastings | Propose moves and accept based on a ratio of posterior densities | General-purpose MCMC sampler |
| Gibbs sampling | Sample iteratively from full conditional distributions | Hierarchical and graphical models |
| Variational inference | Approximate the posterior by a tractable family minimizing KL divergence | Large-scale models, deep generative models |
| Bayesian neural networks | Place a prior on network weights and infer a posterior | Uncertainty quantification |
| Bayesian optimization | Use a probabilistic surrogate to sequentially select query points | Hyperparameter tuning, black-box optimization |
| Gaussian processes | Place a prior over functions parameterized by a kernel | Regression with uncertainty, surrogate modeling |
| Belief propagation | Pass local messages on a graph to compute marginals | Inference in graphical models |
| Probabilistic programming | Specify generative models in code and run general inference | Hierarchical modeling, research prototyping |
Probabilistic programming systems including Stan, PyMC, NumPyro, and Edward provide high-level interfaces for specifying Bayesian models and dispatching them to MCMC, Hamiltonian Monte Carlo, or variational backends. Naive Bayes classifiers were among the first machine learning methods deployed at scale and remain competitive baselines today. Text classification models from naive Bayes through logistic regression and transformers share the Bayesian decision-theoretic framing of choosing the class with maximum posterior probability.[20][30]
Beyond machine learning, Bayes' theorem appears wherever rational agents must update beliefs from evidence. In medicine, evidence-based diagnostic reasoning and screening test interpretation are explicitly Bayesian. In law, Bayesian analyses of forensic evidence have been accepted by some courts, though their use remains controversial. In intelligence analysis, structured techniques such as analysis of competing hypotheses are informal applications of the theorem.[12][16]
Search and rescue operations have used Bayesian methods since the 1960s. The 1968 search for the lost American submarine USS Scorpion used a Bayesian map of search probabilities developed by John Craven of the U.S. Navy. Searchers solicited expert guesses about the submarine's location, combined them into a prior, and updated the map after each unsuccessful search. The submarine was found close to the most probable location. A similar Bayesian search was used in 2011 by Metron Inc. and the Australian and French authorities to locate the wreckage of Air France flight 447 in the South Atlantic, more than two years after the crash.[12][31] Bayes' theorem also underlies pedigree analysis in genetics, I. J. Good's cryptanalysis at Bletchley Park, portfolio analysis in finance, and rational-agent reasoning in AI safety.[12][14]
Several variants and extensions of Bayes' theorem appear repeatedly in applied work. Bayes factors are ratios of marginal likelihoods under competing models, central to Bayesian model comparison because they capture the evidence data provide for one model over another, integrated over the prior of each model's parameters. Harold Jeffreys proposed an interpretive scale running from "barely worth mentioning" to "decisive" evidence that remains in widespread use.[9]
Empirical Bayes methods estimate the prior from the data, typically by combining information across many parallel inference problems. Charles Stein's 1956 result that the MLE of a multivariate normal mean is inadmissible in three or more dimensions and is dominated by a shrinkage estimator gave a frequentist motivation for what was effectively an empirical Bayes calculation. Empirical Bayes is standard in genomics, where thousands of parallel hypothesis tests share information through a learned prior.[32]
Hierarchical Bayesian models place priors on the parameters of priors, producing multi-level structures that capture variation between groups while sharing strength across them. They are common in education research, ecology, sports analytics, and pharmacokinetics, and are a natural application of modern probabilistic programming systems.[7]