Bayes' theorem

Bayes' theorem (also called Bayes' rule or Bayes' law) is a fundamental theorem of probability theory that describes how to update the probability of a hypothesis given new evidence. It expresses the posterior probability of a proposition as proportional to the product of its prior probability and the likelihood of the observed data. In its simplest form the theorem states that P(A | B) = P(B | A) * P(A) / P(B), where P(B) > 0.^[1]^[2]

Named after the Reverend Thomas Bayes (1701-1761), an English Presbyterian minister and amateur mathematician, the theorem was published posthumously in 1763 by his friend Richard Price, who edited Bayes's manuscript and submitted it to the Royal Society of London.^[3]^[4] The theorem was independently rediscovered and substantially generalized by the French mathematician Pierre-Simon Laplace in 1774 and developed further in his 1812 treatise Théorie analytique des probabilités.^[5]

Despite its simple algebraic form, Bayes' theorem provides the foundation for an entire school of statistical reasoning known as Bayesian statistics or Bayesian inference. It underlies medical diagnosis, spam classification, search and rescue operations, scientific model comparison, and large parts of contemporary machine learning. Algorithms from naive Bayes classifiers and Bayesian networks to Markov chain Monte Carlo samplers and variational inference all flow from the rule first written down in Bayes's 1763 essay.^[6]^[7]

Statement of the theorem

Let A and B be two events with P(B) > 0. Bayes' theorem states P(A | B) = P(B | A) * P(A) / P(B). The equation expresses the probability of A after observing B (the posterior) in terms of the prior probability of A, the likelihood of observing B under A, and the total probability of B. Each piece plays a distinct role in Bayesian reasoning.

Symbol	Name	Interpretation
`P(A	B)`	Posterior probability
`P(B	A)`	Likelihood
`P(A)`	Prior probability	Probability of `A` before observing the evidence
`P(B)`	Marginal likelihood	Total probability of `B` over all hypotheses; the normalizing constant

In practice P(B) is often hard to compute directly, and the theorem is used in its proportional form P(A | B) ∝ P(B | A) * P(A). The constant is recovered by requiring posterior probabilities to sum (or integrate) to one over all hypotheses.^[6]

Derivation

Bayes' theorem follows directly from the definition of conditional probability. For events A and B with P(A) > 0 and P(B) > 0:

P(A | B) = P(A ∩ B) / P(B) and P(B | A) = P(A ∩ B) / P(A)

Multiplying the second equation by P(A) gives P(A ∩ B) = P(B | A) * P(A). Substituting into the first equation yields P(A | B) = P(B | A) * P(A) / P(B). The derivation is symmetric in A and B: Bayes' theorem is really a statement about how to invert a conditional probability. Given a forward model that tells us how likely the evidence is under each hypothesis, the theorem flips the conditioning to reason from evidence back to hypothesis.^[1]^[2]

Extended forms

Bayes' theorem can be extended in several directions useful in applications.

Law of total probability form

When the hypothesis space is partitioned into mutually exclusive and exhaustive events {A_1, A_2, ..., A_n}, the marginal likelihood expands by the law of total probability:

P(B) = sum_i P(B | A_i) * P(A_i)

Substituting gives the form used in nearly every elementary probability course:

P(A_k | B) = P(B | A_k) * P(A_k) / sum_i P(B | A_i) * P(A_i)

This is the version Laplace wrote down in his 1774 memoir.^[5]^[8]

Continuous version

For continuous random variables with parameter θ and observed data x, the discrete sum is replaced by an integral and probabilities by densities:

f(θ | x) = f(x | θ) * f(θ) / ∫ f(x | θ) * f(θ) dθ

This form is the workhorse of modern Bayesian statistics. Here f(x | θ) is the likelihood, f(θ) is the prior density, and the denominator is the marginal likelihood. Computing the integral is often intractable for high-dimensional θ, motivating the use of MCMC and variational inference methods.^[6]^[7]

Odds form and Bayes factors

Dividing the posterior probability of a hypothesis by the posterior of its negation eliminates the marginal likelihood:

P(A | B) / P(¬A | B) = ( P(B | A) / P(B | ¬A) ) * ( P(A) / P(¬A) )

In words, posterior odds equal the likelihood ratio multiplied by prior odds. The ratio P(B | A) / P(B | ¬A) is the Bayes factor for A against ¬A, and it summarizes how strongly the evidence favors A. Bayes factors are widely used in model comparison because they do not require priors over which model is correct.^[9]

Historical background

Thomas Bayes and the 1763 essay

Thomas Bayes was born in London, probably in 1701, though the date is uncertain and some sources give 1702. He was the son of Joshua Bayes, one of the first six Nonconformist ministers publicly ordained in England after the Act of Toleration. Thomas Bayes became a Presbyterian minister and from around 1733 served the congregation at Tunbridge Wells in Kent. He was elected a Fellow of the Royal Society in 1742 for mathematical work that included a defence of Newton's calculus against Bishop Berkeley.^[3]^[10]

Bayes did not publish his most famous result during his lifetime. After his death on 7 April 1761, his friend Richard Price (1723-1791) received Bayes's mathematical papers, among them an essay titled "An Essay towards solving a Problem in the Doctrine of Chances." Price edited the manuscript, added an introduction and appendix, and submitted it to the Royal Society. It was read on 23 December 1763 and published in 1764 in Philosophical Transactions of the Royal Society of London, volume 53, pages 370-418.^[1]^[3] Bayes framed the problem geometrically, using a thought experiment involving balls thrown onto a square table, and solved a special case with a uniform prior on the unit interval. The general formulation now called Bayes' theorem was extracted from his work by later mathematicians, especially Laplace.^[4]^[11]

Laplace's contributions

The French mathematician Pierre-Simon Laplace (1749-1827) independently arrived at the principle of inverse probability in his 1774 memoir "Mémoire sur la probabilité des causes par les événemens," without knowing of Bayes's earlier essay. He stated what we would now call Bayes' theorem in essentially its modern general form, applied it to problems including the estimation of the masses of Saturn and Jupiter from astronomical observations, and used it as the core engine of his probability theory.^[5]^[8] Laplace's monumental Théorie analytique des probabilités, first published in 1812, developed the Bayesian approach so thoroughly that for most of the nineteenth century the method was known as the "probability of causes" or "inverse probability" and was attributed primarily to Laplace. The modern name "Bayes' theorem" became standard only in the early twentieth century.^[5]^[12]

The twentieth-century debate and revival

In the early twentieth century, Bayesian methods fell out of favor. Ronald A. Fisher, Jerzy Neyman, and Egon Pearson developed an alternative framework based on long-run frequencies of repeated experiments, including significance tests, confidence intervals, and hypothesis testing. Fisher was sharply critical of subjective priors. The frequentist school dominated mainstream statistics from the 1920s through the 1950s.^[12]^[13]

A Bayesian revival began in the 1950s, driven by mathematicians and philosophers who argued that probability could legitimately represent degrees of belief. Key figures included Harold Jeffreys, whose 1939 Theory of Probability gave a comprehensive Bayesian treatment with objective priors; Bruno de Finetti, who developed a subjective theory and proved his representation theorem for exchangeable sequences; Leonard J. Savage, whose 1954 The Foundations of Statistics axiomatized subjective probability; and I. J. Good, who applied Bayesian methods to cryptanalysis at Bletchley Park.^[12]^[14]

A practical revival came in the late 1980s and early 1990s with cheap computing and the rediscovery of Markov chain Monte Carlo methods. The Metropolis algorithm dated from 1953 and Hastings's generalization from 1970, but it was the 1990 paper by Alan Gelfand and Adrian Smith on Gibbs sampling that brought MCMC into mainstream statistics. Integrals that had previously made Bayesian analysis intractable could now be approximated numerically, and Bayesian methods spread rapidly through statistics, machine learning, and the empirical sciences.^[12]^[15]

Worked examples

Medical test example

The classical illustration is a diagnostic test. Suppose a disease has 1 percent prevalence, so the prior P(D) = 0.01. A blood test has sensitivity (true positive rate) P(+ | D) = 0.99 and specificity (true negative rate) P(- | ¬D) = 0.99, implying a false positive rate P(+ | ¬D) = 0.01. A patient tests positive. What is the probability they actually have the disease?

Intuition often suggests an answer near 99 percent, but Bayes' theorem gives a very different result. Apply the theorem with the law of total probability:

P(D | +) = P(+ | D) * P(D) / ( P(+ | D) * P(D) + P(+ | ¬D) * P(¬D) )

Substituting the numerical values:

P(D | +) = (0.99 * 0.01) / ( 0.99 * 0.01 + 0.01 * 0.99 ) = 0.0099 / 0.0198 = 0.5

A positive test result raises the probability of disease only to 50 percent. With 1 percent prevalence the population contains far more healthy people than sick people, so even a 1 percent false positive rate produces as many false positives as the test produces true positives. This is the textbook illustration of the base rate fallacy and is why a positive screening result usually triggers a confirmatory test rather than an immediate diagnosis.^[16]^[17] If the same test is applied in a high-risk population where prevalence is 50 percent, the calculation gives P(D | +) = 0.99. The diagnostic value of a test depends on the prior probability that the patient has the disease, not only on its sensitivity and specificity.

A classic introductory exercise: Jar 1 contains 30 vanilla and 10 chocolate cookies; Jar 2 contains 20 vanilla and 20 chocolate cookies. A jar is chosen uniformly at random and a cookie is drawn. The cookie is vanilla. What is the probability it came from Jar 1?

With priors P(J1) = P(J2) = 0.5 and likelihoods P(V | J1) = 0.75, P(V | J2) = 0.5, Bayes' theorem gives P(J1 | V) = (0.75 * 0.5) / (0.75 * 0.5 + 0.5 * 0.5) = 0.375 / 0.625 = 0.6. The vanilla cookie raises the probability the jar was Jar 1 from 0.5 to 0.6.^[2]^[18]

Monty Hall problem

In the Monty Hall problem, a contestant on a game show is shown three doors. Behind one door is a car and behind the other two are goats. The contestant picks a door, after which the host, who knows where the car is, opens one of the remaining doors to reveal a goat and offers the contestant the chance to switch. Should they switch?

Let C_i denote the event that the car is behind door i. Suppose the contestant picks door 1 and the host opens door 3. By Bayes' theorem P(C_2 | H_3) = P(H_3 | C_2) * P(C_2) / P(H_3). If the car is behind door 2 the host must open door 3, so P(H_3 | C_2) = 1. If the car is behind door 1 the host chooses door 2 or 3 uniformly, so P(H_3 | C_1) = 0.5. If the car is behind door 3 the host cannot open it, so P(H_3 | C_3) = 0. The priors are all 1/3. The marginal P(H_3) = 1/2, and the posterior is P(C_2 | H_3) = (1 * 1/3) / (1/2) = 2/3. Switching doubles the contestant's probability of winning the car, from 1/3 to 2/3.^[19]

Spam filter intuition

Bayes' theorem underlies the simplest and most influential family of spam filtering algorithms. A naive Bayes spam filter treats an email as a bag of words and computes the posterior probability that the email is spam given the words it contains. For a single word w and hypotheses S (spam) and H (ham):

P(S | w) = P(w | S) * P(S) / ( P(w | S) * P(S) + P(w | H) * P(H) )

The filter estimates P(w | S) and P(w | H) from word frequencies in a labeled training corpus. To handle a whole message, the filter assumes words occur independently given the class, the "naive" assumption that gives naive Bayes its name. The assumption is clearly false in real text, but the resulting classifiers are surprisingly accurate, fast to train, and easy to update online. Paul Graham's 2002 essay "A Plan for Spam" popularized the approach and drove a step change in commercial filtering.^[20]^[21]

Common fallacies and misinterpretations

The arithmetic of Bayes' theorem is elementary, but applying it correctly under stress is famously difficult. Several reasoning errors have acquired their own names.

The base rate fallacy consists of ignoring the prior probability of a hypothesis and treating the likelihood as if it were the posterior. People who hear that a test is 99 percent accurate often conclude that a positive result implies a 99 percent chance of disease, neglecting the small base rate. Daniel Kahneman and Amos Tversky documented many variants of this error in their work on heuristics and biases.^[16]^[22]

The prosecutor's fallacy confuses P(evidence | innocent) with P(innocent | evidence). A prosecutor may argue that because the probability of a particular DNA match in an innocent person is one in a million, the probability that the defendant is innocent given the match is one in a million. This conflation only holds when prior probabilities are equal, which is rarely the case. The prosecutor's fallacy was named in a 1987 paper by William Thompson and Edward Schumann and has been blamed for several miscarriages of justice.^[23]

The defendant's fallacy is the mirror image: ignoring evidence on the grounds that many people would also match. A defense attorney might argue that since one million people share a particular DNA profile fragment, the evidence is worthless. This neglects how the suspect population is restricted by other circumstances.^[24]

A prominent real-world case is the wrongful 1999 conviction of Sally Clark in the United Kingdom for the murder of her two infant sons. Expert witness Sir Roy Meadow testified that the probability of two cot deaths in one family was about 1 in 73 million, computed by squaring the rate of a single cot death. The figure assumed independence between sibling deaths, an assumption now known to be wrong, and conflated P(two deaths | innocent) with P(innocent | two deaths). The Royal Statistical Society publicly criticized the reasoning. Clark's conviction was overturned in 2003, after she had served more than three years in prison.^[25]^[26]

Bayesian inference

Bayes' theorem is the engine of Bayesian inference, an approach in which all unknowns are treated as random variables and probability is interpreted as a degree of belief. The analyst begins with a prior p(θ) over the unknown parameter θ. After observing data x with likelihood p(x | θ), the prior is updated to a posterior p(θ | x) ∝ p(x | θ) * p(θ). The posterior is the complete answer to the inference problem and is used for parameter estimation, prediction, and decision making.^[6]^[7]

When the prior and likelihood belong to suitable families, the posterior takes the same functional form as the prior, and the prior is called a conjugate prior. Classic examples include the beta-binomial, gamma-Poisson, and normal-normal pairs. Conjugate analyses give closed-form posteriors and were the dominant style of Bayesian inference before the spread of MCMC.^[6]

From the posterior one can summarize uncertainty with a credible interval, defined so that the parameter lies inside the interval with a specified posterior probability such as 95 percent. Credible intervals look similar to frequentist confidence intervals but have a fundamentally different interpretation. A 95 percent credible interval is a set of parameter values containing 95 percent of posterior probability. A 95 percent confidence interval is a procedure that, in repeated use, would contain the true parameter 95 percent of the time. The two can coincide numerically in simple problems but in general answer different questions.^[6]^[27]

When the posterior cannot be computed in closed form, modern Bayesian inference relies on numerical methods. Markov chain Monte Carlo algorithms, including Metropolis-Hastings and Gibbs sampling, produce posterior samples by constructing a Markov chain whose stationary distribution is the posterior. Variational inference instead approximates the posterior by the closest member of a tractable family under the Kullback-Leibler divergence.^[15]^[28]

Bayesian versus frequentist statistics

The Bayesian approach contrasts with the frequentist school, which interprets probability as the long-run relative frequency of an event in a hypothetical infinite sequence of trials and treats parameters as fixed unknowns rather than random variables. The frequentist toolkit includes maximum likelihood estimation, hypothesis testing with p-values, and confidence intervals derived from the sampling distribution of an estimator.^[12]^[13]

The philosophical divide centers on the meaning of probability. Bayesians treat probability as a numerical measure of degree of belief, applicable to any proposition for which an agent has uncertainty, including one-off events such as the outcome of a particular election. Frequentists restrict probability to events that can in principle be embedded in a sequence of repeatable trials and reject the notion of probability for parameters that are simply unknown rather than random. The Bayesian framework requires priors, which can be subjective. The frequentist framework avoids priors but at the cost of more circuitous interpretations of inferential statements.^[12]^[14]

The debate was sometimes acrimonious in the twentieth century, with Fisher dismissing Bayesian methods as unscientific. Since the 2000s the dispute has cooled. Most working statisticians and machine learning researchers use whichever framework better fits the problem, and methods including hierarchical models, empirical Bayes, and regularized regression blur the boundary. The two approaches often agree numerically when the prior is weak and the data are strong, and tend to disagree primarily in small-sample or highly structured problems.^[12]^[29]

Applications in machine learning

Bayes' theorem and its consequences pervade machine learning. The table below lists important methods that derive directly from Bayesian reasoning.

Method	Idea	Typical use
Naive Bayes classifiers	Apply Bayes' theorem assuming features are conditionally independent given the class	Text classification, spam filtering
Bayesian networks	Directed acyclic graphs encoding conditional independence	Diagnostic systems, causal modeling
Markov chain Monte Carlo	Markov chains whose stationary distribution is the posterior	Posterior sampling, hierarchical models
Metropolis-Hastings	Propose moves and accept based on a ratio of posterior densities	General-purpose MCMC sampler
Gibbs sampling	Sample iteratively from full conditional distributions	Hierarchical and graphical models
Variational inference	Approximate the posterior by a tractable family minimizing KL divergence	Large-scale models, deep generative models
Bayesian neural networks	Place a prior on network weights and infer a posterior	Uncertainty quantification
Bayesian optimization	Use a probabilistic surrogate to sequentially select query points	Hyperparameter tuning, black-box optimization
Gaussian processes	Place a prior over functions parameterized by a kernel	Regression with uncertainty, surrogate modeling
Belief propagation	Pass local messages on a graph to compute marginals	Inference in graphical models
Probabilistic programming	Specify generative models in code and run general inference	Hierarchical modeling, research prototyping

Probabilistic programming systems including Stan, PyMC, NumPyro, and Edward provide high-level interfaces for specifying Bayesian models and dispatching them to MCMC, Hamiltonian Monte Carlo, or variational backends. Naive Bayes classifiers were among the first machine learning methods deployed at scale and remain competitive baselines today. Text classification models from naive Bayes through logistic regression and transformers share the Bayesian decision-theoretic framing of choosing the class with maximum posterior probability.^[20]^[30]

Other applications

Beyond machine learning, Bayes' theorem appears wherever rational agents must update beliefs from evidence. In medicine, evidence-based diagnostic reasoning and screening test interpretation are explicitly Bayesian. In law, Bayesian analyses of forensic evidence have been accepted by some courts, though their use remains controversial. In intelligence analysis, structured techniques such as analysis of competing hypotheses are informal applications of the theorem.^[12]^[16]

Search and rescue operations have used Bayesian methods since the 1960s. The 1968 search for the lost American submarine USS Scorpion used a Bayesian map of search probabilities developed by John Craven of the U.S. Navy. Searchers solicited expert guesses about the submarine's location, combined them into a prior, and updated the map after each unsuccessful search. The submarine was found close to the most probable location. A similar Bayesian search was used in 2011 by Metron Inc. and the Australian and French authorities to locate the wreckage of Air France flight 447 in the South Atlantic, more than two years after the crash.^[12]^[31] Bayes' theorem also underlies pedigree analysis in genetics, I. J. Good's cryptanalysis at Bletchley Park, portfolio analysis in finance, and rational-agent reasoning in AI safety.^[12]^[14]

Notable formulations

Several variants and extensions of Bayes' theorem appear repeatedly in applied work. Bayes factors are ratios of marginal likelihoods under competing models, central to Bayesian model comparison because they capture the evidence data provide for one model over another, integrated over the prior of each model's parameters. Harold Jeffreys proposed an interpretive scale running from "barely worth mentioning" to "decisive" evidence that remains in widespread use.^[9]

Empirical Bayes methods estimate the prior from the data, typically by combining information across many parallel inference problems. Charles Stein's 1956 result that the MLE of a multivariate normal mean is inadmissible in three or more dimensions and is dominated by a shrinkage estimator gave a frequentist motivation for what was effectively an empirical Bayes calculation. Empirical Bayes is standard in genomics, where thousands of parallel hypothesis tests share information through a learned prior.^[32]

Hierarchical Bayesian models place priors on the parameters of priors, producing multi-level structures that capture variation between groups while sharing strength across them. They are common in education research, ecology, sports analytics, and pharmacokinetics, and are a natural application of modern probabilistic programming systems.^[7]

References

Bayes, T. (1763). "An Essay towards solving a Problem in the Doctrine of Chances." *Philosophical Transactions of the Royal Society of London*, 53, 370-418. https://royalsocietypublishing.org/doi/10.1098/rstl.1763.0053
Joyce, J. (2003, revised 2021). "Bayes' Theorem." *Stanford Encyclopedia of Philosophy*. https://plato.stanford.edu/entries/bayes-theorem/
Bellhouse, D. R. (2004). "The Reverend Thomas Bayes, FRS: A Biography to Celebrate the Tercentenary of His Birth." *Statistical Science*, 19(1), 3-43. https://projecteuclid.org/journals/statistical-science/volume-19/issue-1/10.1214/088342304000000189.full
Stigler, S. M. (1982). "Thomas Bayes's Bayesian Inference." *Journal of the Royal Statistical Society. Series A*, 145(2), 250-258. https://www.jstor.org/stable/2981538
Laplace, P.-S. (1812). *Théorie analytique des probabilités*. Courcier, Paris. https://gallica.bnf.fr/ark:/12148/bpt6k88764q
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). *Bayesian Data Analysis*, 3rd ed. Chapman and Hall/CRC. http://www.stat.columbia.edu/~gelman/book/
Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. https://www.microsoft.com/en-us/research/people/cmbishop/prml-book/
Laplace, P.-S. (1774). "Mémoire sur la probabilité des causes par les événemens." *Mémoires de l'Académie Royale des Sciences de Paris*, 6, 621-656. https://gallica.bnf.fr/ark:/12148/bpt6k77596b
Kass, R. E. and Raftery, A. E. (1995). "Bayes Factors." *Journal of the American Statistical Association*, 90(430), 773-795. https://www.tandfonline.com/doi/abs/10.1080/01621459.1995.10476572
Dale, A. I. (1999). *A History of Inverse Probability: From Thomas Bayes to Karl Pearson*, 2nd ed. Springer. https://link.springer.com/book/10.1007/978-1-4419-8652-8
Stigler, S. M. (1986). *The History of Statistics: The Measurement of Uncertainty before 1900*. Harvard University Press. https://www.hup.harvard.edu/books/9780674403413
McGrayne, S. B. (2011). *The Theory That Would Not Die*. Yale University Press. https://yalebooks.yale.edu/book/9780300188226/the-theory-that-would-not-die/
Fienberg, S. E. (2006). "When Did Bayesian Inference Become 'Bayesian'?" *Bayesian Analysis*, 1(1), 1-40. https://projecteuclid.org/journals/bayesian-analysis/volume-1/issue-1/10.1214/06-BA101.full
Jeffreys, H. (1939, 3rd ed. 1961). *Theory of Probability*. Oxford University Press. https://global.oup.com/academic/product/theory-of-probability-9780198503682
Gelfand, A. E. and Smith, A. F. M. (1990). "Sampling-Based Approaches to Calculating Marginal Densities." *Journal of the American Statistical Association*, 85(410), 398-409. https://www.tandfonline.com/doi/abs/10.1080/01621459.1990.10476213
Tversky, A. and Kahneman, D. (1974). "Judgment under Uncertainty: Heuristics and Biases." *Science*, 185(4157), 1124-1131. https://www.science.org/doi/10.1126/science.185.4157.1124
Eddy, D. M. (1982). "Probabilistic Reasoning in Clinical Medicine." In *Judgment Under Uncertainty: Heuristics and Biases*, Cambridge University Press, 249-267. https://www.cambridge.org/core/books/judgment-under-uncertainty/
Downey, A. B. (2013). *Think Bayes: Bayesian Statistics in Python*. O'Reilly. https://greenteapress.com/wp/think-bayes/
Selvin, S. (1975). "A Problem in Probability" (letter). *The American Statistician*, 29(1), 67. https://www.tandfonline.com/doi/abs/10.1080/00031305.1975.10479121
Graham, P. (2002). "A Plan for Spam." http://www.paulgraham.com/spam.html
Sahami, M., Dumais, S., Heckerman, D., and Horvitz, E. (1998). "A Bayesian Approach to Filtering Junk E-mail." *AAAI Workshop on Learning for Text Categorization*. https://www.aaai.org/Papers/Workshops/1998/WS-98-05/WS98-05-009.pdf
Bar-Hillel, M. (1980). "The Base-Rate Fallacy in Probability Judgments." *Acta Psychologica*, 44(3), 211-233. https://www.sciencedirect.com/science/article/abs/pii/0001691880900462
Thompson, W. C. and Schumann, E. L. (1987). "Interpretation of Statistical Evidence in Criminal Trials." *Law and Human Behavior*, 11(3), 167-187. https://link.springer.com/article/10.1007/BF01044641
Aitken, C. G. G. and Taroni, F. (2004). *Statistics and the Evaluation of Evidence for Forensic Scientists*, 2nd ed. Wiley. https://onlinelibrary.wiley.com/doi/book/10.1002/0470011238
Royal Statistical Society (2002). "News release: RSS concerned by issues raised in Sally Clark case." https://rss.org.uk/news-publication/news-publications/2002/general-news/royal-statistical-society-concerned-by-issues-rai/
Hill, R. (2004). "Multiple Sudden Infant Deaths: Coincidence or Beyond Coincidence?" *Paediatric and Perinatal Epidemiology*, 18(5), 320-326. https://onlinelibrary.wiley.com/doi/10.1111/j.1365-3016.2004.00560.x
Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., and Wagenmakers, E.-J. (2016). "The Fallacy of Placing Confidence in Confidence Intervals." *Psychonomic Bulletin and Review*, 23(1), 103-123. https://link.springer.com/article/10.3758/s13423-015-0947-8
Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). "Variational Inference: A Review for Statisticians." *Journal of the American Statistical Association*, 112(518), 859-877. https://www.tandfonline.com/doi/full/10.1080/01621459.2017.1285773
Efron, B. (2005). "Bayesians, Frequentists, and Scientists." *Journal of the American Statistical Association*, 100(469), 1-5. https://www.tandfonline.com/doi/abs/10.1198/016214505000000033
Rasmussen, C. E. and Williams, C. K. I. (2006). *Gaussian Processes for Machine Learning*. MIT Press. https://gaussianprocess.org/gpml/
Stone, L. D., Keller, C. M., Kratzke, T. M., and Strumpfer, J. P. (2014). "Search for the Wreckage of Air France Flight AF 447." *Statistical Science*, 29(1), 69-80. https://projecteuclid.org/journals/statistical-science/volume-29/issue-1/10.1214/13-STS420.full
Efron, B. (2010). *Large-Scale Inference: Empirical Bayes Methods*. Cambridge University Press. https://www.cambridge.org/core/books/largescale-inference/A0B183B0080A92966497F12CE5D12589

Statement of the theorem

Derivation

Extended forms

Law of total probability form

Continuous version

Odds form and Bayes factors

Historical background

Thomas Bayes and the 1763 essay

Laplace's contributions

The twentieth-century debate and revival

Worked examples

Medical test example

Cookie jar example

Monty Hall problem

Spam filter intuition

Common fallacies and misinterpretations

Bayesian inference

Bayesian versus frequentist statistics

Applications in machine learning

Other applications

Notable formulations

See also

References

Improve this article

Related Articles

Bayesian inference

ARC-AGI 2

Markov Chain Monte Carlo

Independently and Identically Distributed (i.i.d.)

Log-Odds

Earth Mover's Distance

Statement of the theorem

Derivation

Extended forms

Law of total probability form

Continuous version

Odds form and Bayes factors

Historical background

Thomas Bayes and the 1763 essay

Laplace's contributions

The twentieth-century debate and revival

Worked examples

Medical test example

Cookie jar example

Monty Hall problem

Spam filter intuition

Common fallacies and misinterpretations

Bayesian inference

Bayesian versus frequentist statistics

Applications in machine learning

Other applications

Notable formulations

See also

References

Related Articles

Bayesian inference

ARC-AGI 2

Markov Chain Monte Carlo

Independently and Identically Distributed (i.i.d.)

Log-Odds

Earth Mover's Distance