Bayesian statistics
Last reviewed
May 1, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,968 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,968 words
Add missing citations, update stale details, or suggest a clearer explanation.
Bayesian statistics is a branch of statistics in which probability is interpreted as a degree of belief, and statistical inference is performed by updating those beliefs in light of new evidence using Bayes' theorem. It contrasts with frequentist statistics, where probability refers to the long-run frequency of an event in repeated experiments and parameters are treated as fixed but unknown constants. In the Bayesian framework, parameters are themselves random variables with probability distributions, and the central object of inference is the posterior distribution over those parameters given the observed data.
The approach is named for the Reverend Thomas Bayes, whose 1763 essay laid the mathematical foundations, but it took roughly two centuries, several philosophical revivals, and a computational revolution before Bayesian methods became practical at the scale modern science demands. Today the framework is used in clinical trial design, machine learning, cosmology, election forecasting, and large parts of cognitive science, and it underlies a thriving ecosystem of probabilistic programming languages such as Stan, PyMC, and NumPyro.
All of Bayesian statistics rests on a single rearrangement of the definition of conditional probability. Given a hypothesis or parameter $H$ and observed data $D$:
$$P(H \mid D) = \frac{P(D \mid H),P(H)}{P(D)}$$
A more compact and frequently quoted form is:
$$\text{Posterior} \propto \text{Likelihood} \times \text{Prior}$$
The four ingredients each have a precise role:
| Term | Symbol | Meaning |
|---|---|---|
| Posterior | $P(H \mid D)$ | Updated probability of the hypothesis after seeing the data |
| Likelihood | $P(D \mid H)$ | Probability of the observed data assuming the hypothesis is true |
| Prior belief | $P(H)$ | Probability assigned to the hypothesis before seeing the data |
| Marginal likelihood / evidence | $P(D) = \int P(D \mid H),P(H),dH$ | Normalising constant that makes the posterior integrate to one; also used for model comparison |
In most realistic problems the integral defining $P(D)$ has no closed form, so a great deal of Bayesian computation is devoted either to evaluating that integral, sidestepping it, or sampling from the unnormalised posterior $P(D \mid H),P(H)$ directly.
Bayesian statistics has its own vocabulary, much of it shared with Bayesian inference and probabilistic modelling more broadly.
| Concept | Description |
|---|---|
| Prior | Distribution over parameters that encodes belief before seeing the data. May be informative (encoding domain knowledge) or uninformative or weakly informative (intended to let the data dominate). |
| Likelihood | Probability model linking parameters to data; the same object used in maximum likelihood estimation, but interpreted as a function of the parameters given fixed data. |
| Posterior | Distribution over parameters after conditioning on the observed data. Summaries include the posterior mean, median, mode (the maximum a posteriori or MAP estimate), and quantiles. |
| Marginal likelihood | $P(D)$, also called the model evidence. Used in Bayes factors and Bayesian model comparison. |
| Posterior predictive distribution | $P(D_{\text{new}} \mid D) = \int P(D_{\text{new}} \mid \theta),P(\theta \mid D),d\theta$, the predictive distribution for new data after marginalising over parameter uncertainty. |
| Hyperparameter | A parameter of the prior distribution itself; treating hyperparameters as random variables with their own priors gives rise to hierarchical models. |
| Conjugate prior | Prior chosen so that the posterior belongs to the same parametric family. Conjugacy yields closed-form posteriors and was historically the main way Bayesian problems were made tractable. |
| Credible interval | Interval that contains a specified amount of posterior probability (for example 95%). The Bayesian counterpart to a frequentist confidence interval, but with a direct probabilistic interpretation: there is a 95% probability the parameter lies in the interval given the data and prior. |
| Highest posterior density (HPD) region | The smallest region containing a given posterior probability mass; a common way of summarising multimodal posteriors. |
The history of Bayesian statistics is unusually contested for a mathematical subject; the framework spent more than a century in the wilderness before re-emerging as a dominant approach.
The Reverend Thomas Bayes, an English Presbyterian minister, formulated his theorem some time in the late 1740s. He never published it. After his death in 1761 his friend Richard Price edited the manuscript and read it before the Royal Society in December 1763 as "An Essay towards solving a Problem in the Doctrine of Chances". The essay tackled the inverse probability problem of inferring an unknown chance from observed successes, and it included Price's own introduction explaining the philosophical motivation.
Pierre-Simon Laplace developed the same ideas independently and far more systematically. His 1812 "Théorie analytique des probabilités" made what we now call Bayesian inference a workhorse of nineteenth century science, applied to problems in astronomy, demography, and the reliability of legal testimony. For Laplace, probability was the natural language of partial knowledge, and assigning prior probabilities was simply a matter of insufficient reason.
The twentieth century saw a sharp turn against this view. Ronald Fisher, Jerzy Neyman, and Egon Pearson built the frequentist edifice between roughly 1920 and 1940: maximum likelihood, significance tests, p-values, confidence intervals, and the Neyman-Pearson framework for hypothesis testing. Their work pushed Bayesian methods to the margins of academic statistics, in part because the priors looked subjective and arbitrary, and in part because the resulting integrals were generally impossible to compute.
A quiet revival began in the 1930s and 1940s. Bruno de Finetti's 1937 paper "La prévision: ses lois logiques, ses sources subjectives" gave subjective probability a rigorous behavioural foundation, deriving the rules of probability from coherent betting odds and proving the famous representation theorem for exchangeable sequences. Leonard Savage's 1954 "The Foundations of Statistics" combined de Finetti's subjective probability with the von Neumann-Morgenstern theory of utility to produce the canonical axiomatisation of Bayesian decision theory. Edwin Jaynes argued through the 1950s and 1960s that probability theory is simply extended logic, a view eventually compiled in his posthumous 2003 book "Probability Theory: The Logic of Science".
The practical breakthrough was computational. The Metropolis algorithm appeared in a 1953 physics paper by Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller, and was generalised to non-symmetric proposals by W. K. Hastings in 1970. Stuart and Donald Geman introduced the Gibbs sampler in 1984 for image restoration. Once Alan Gelfand and Adrian Smith showed in 1990 how Gibbs sampling could be applied to mainstream Bayesian problems, Markov Chain Monte Carlo (MCMC) became the engine that let Bayesian inference handle realistic models. The 2014 paper by Matthew Hoffman and Andrew Gelman introduced the No-U-Turn Sampler (NUTS), an adaptive form of Hamiltonian Monte Carlo that powers modern Stan and freed users from manually tuning the number of leapfrog steps.
The last fifteen years brought probabilistic programming. Stan, PyMC, Pyro, NumPyro, and TensorFlow Probability turned model specification into something close to scripting, and they made Bayesian inference accessible outside the small community of statisticians willing to write their own samplers.
The two paradigms answer different questions and use the same data in different ways. The differences are real but often less stark in practice than the philosophical debates suggest.
| Issue | Frequentist | Bayesian |
|---|---|---|
| Probability interpretation | Long-run frequency in repeated trials | Degree of belief; a measure of uncertainty |
| Parameters | Fixed unknown constants | Random variables with distributions |
| Inference target | Estimators with good sampling properties | Posterior distributions over parameters |
| Interval estimate | Confidence interval (about the procedure) | Credible interval (about the parameter, given data) |
| Hypothesis testing | Null hypothesis significance testing, p-values | Posterior probabilities, Bayes factors, model comparison |
| Prior information | Not formally used in pure frequentist analysis | Encoded explicitly as a prior distribution |
| Computational cost | Usually low; closed-form or optimisation | Usually high; requires sampling or approximation of the posterior |
| Model checking | Goodness-of-fit tests, residual analysis | Posterior predictive checks, cross-validation |
| Sequential updating | Awkward; multiple testing corrections needed | Natural; today's posterior is tomorrow's prior |
A classic example: a frequentist 95% confidence interval is constructed so that, across many hypothetical repetitions of the experiment, 95% of intervals computed this way will contain the true parameter. A 95% Bayesian credible interval is one in which, given this particular data set and prior, the posterior probability that the parameter lies in the interval is 95%. The Bayesian statement is the one most people think a confidence interval means.
Most applied Bayesian work uses a small set of recurring building blocks. Many of the simplest models exploit conjugacy, where prior and likelihood are mathematically matched so the posterior has a known form.
| Model | Likelihood | Prior | Notes |
|---|---|---|---|
| Beta-Binomial | Binomial | Beta | Canonical coin-flip example; posterior is Beta with updated parameters. |
| Normal-Normal | Normal with known variance | Normal | Posterior mean is a precision-weighted average of prior mean and sample mean. |
| Normal-Inverse-Gamma | Normal with unknown variance | Normal-Inverse-Gamma | Standard joint conjugate for mean and variance of a Gaussian. |
| Dirichlet-Multinomial | Multinomial | Dirichlet | Used for categorical data, language models, topic models. |
| Poisson-Gamma | Poisson | Gamma | Counts data; rate posterior is Gamma. |
| Bayesian linear regression | Gaussian likelihood on residuals | Gaussian or hierarchical priors on coefficients | Gives full posterior over coefficients and predictive intervals that account for parameter uncertainty. |
| Bayesian generalised linear models | GLM (logistic, Poisson, etc.) | Weakly informative priors | The bread and butter of applied Bayesian regression. |
| Hierarchical / multilevel models | Group-level likelihoods | Group-level parameters drawn from a population prior | Often called the killer app of Bayesian methods; partial pooling shrinks group estimates toward the population mean. |
| Gaussian processes | GP prior over functions | Kernel-controlled prior | Nonparametric regression and classification with calibrated uncertainty. |
| Mixture models | Weighted sum of component distributions | Dirichlet weights, component-wise priors | Used for density estimation and clustering; Dirichlet process mixtures extend to infinite components. |
| Latent Dirichlet Allocation (LDA) | Multinomial over words | Dirichlet over topics | A pure Bayesian topic model introduced by Blei, Ng, and Jordan in 2003. |
| Hidden Markov models with Bayesian priors | State transition and emission distributions | Dirichlet on transitions, conjugate priors on emissions | Used in speech recognition, biology, finance. |
| Bayesian neural networks | Neural network likelihood | Distribution over weights | Replace point-estimate weights with distributions to get predictive uncertainty. |
| Bayesian networks | Factorised joint over a directed graph | Conditional probability tables, often with Dirichlet priors | Graphical model framework that includes Naive Bayes as a special case. |
The computational story is the reason Bayesian methods went from a niche philosophical position in the 1970s to a default tool in the 2020s. There is no single way to compute a posterior; choice of method depends on the model size, structure, and how much accuracy you need.
| Method | What it does | Strengths | Limitations |
|---|---|---|---|
| Conjugate priors | Closed-form posterior using algebraic conjugacy | Exact; instantaneous | Restricted to a small set of model families |
| Laplace approximation | Gaussian fit at the posterior mode | Fast; easy to implement | Poor for skewed or multimodal posteriors |
| Variational inference (VI) | Approximate posterior with a tractable family by maximising the evidence lower bound (ELBO) | Scales to large data; deterministic | Tends to underestimate variance; quality depends on the variational family |
| Mean-field VI | VI with independent factors per parameter | Especially fast | Ignores posterior correlations |
| Stochastic / black-box VI | VI using Monte Carlo gradients of the ELBO | Works with arbitrary models | Variance of gradient estimators can be high |
| Metropolis-Hastings (1953/1970) | Generic random-walk MCMC with accept/reject | Very general | Can mix slowly in high dimensions |
| Gibbs sampling (Geman & Geman 1984) | MCMC sampling from full conditionals | Easy when conditionals are tractable | Mixes slowly under strong correlations |
| Hamiltonian Monte Carlo / NUTS (Hoffman & Gelman 2014) | MCMC using gradient information from the log posterior | State of the art for continuous parameters; powers Stan | Requires differentiable models; expensive per step |
| Sequential Monte Carlo / particle filters | Recursively reweight and resample particles to track an evolving posterior | Natural for state-space models and online updates | Particle degeneracy in high dimensions |
| Importance sampling | Reweight samples from a proposal to estimate posterior expectations | Embarrassingly parallel | Fails badly when proposal is far from the posterior |
| Annealed importance sampling | Importance sampling along a temperature schedule | Useful for marginal likelihoods | Schedule design is delicate |
| INLA (Integrated Nested Laplace Approximation) | Combine Laplace approximations with numerical integration for latent Gaussian models | Very fast for the right model class | Restricted to latent Gaussian structure |
MCMC convergence diagnostics are a routine part of Bayesian workflow. Effective sample size, the Gelman-Rubin $\hat R$ statistic, trace plots, and posterior predictive checks are standard tools, and good Bayesian software reports them by default.
Probabilistic programming languages let users describe a generative model in code and leave the inference engine to handle posterior computation.
| Software | Language / backend | Notes |
|---|---|---|
| Stan (Carpenter et al. 2017) | C++ engine with R, Python, Julia, command-line, and other interfaces | NUTS is the default sampler; widely used in academia, industry, and pharma. |
| PyMC (Salvatier, Wiecki & Fonnesbeck 2016) | Python; originally Theano, now PyTensor and JAX | Pythonic API; supports NUTS, variational inference, and JAX-backed samplers. |
| Pyro (Bingham et al. 2018) | Python on PyTorch | Built at Uber for deep generative models and stochastic variational inference. |
| NumPyro (Phan, Pradhan & Jankowiak 2019) | Python on JAX | Lightweight rewrite of Pyro with very fast NUTS and SVI on accelerators. |
| Edward / TensorFlow Probability | Python on TensorFlow | TFP is the actively maintained successor to Edward, used inside Google. |
| BUGS / WinBUGS / OpenBUGS | Custom DSL | The original Bayesian DSL from the 1990s; mostly historical now. |
| JAGS | C++ implementation of the BUGS language | Still common in ecology and social science. |
| Turing.jl | Julia | Composable inference algorithms in Julia. |
These tools share a common pattern: write the model once in a high-level language, get gradients automatically through differentiable programming, and run any of several inference algorithms without rewriting the model. That separation between model and inference is what made probabilistic programming possible.
Many machine learning techniques have a Bayesian reading, and several were originally derived from a Bayesian starting point.
| Topic | Bayesian connection |
|---|---|
| L2 regularisation | Equivalent to maximum a posteriori estimation under a Gaussian prior on the weights. |
| L1 regularisation (Lasso) | Equivalent to MAP estimation under a Laplace prior, which encourages sparsity. |
| Dropout | Yarin Gal's 2016 thesis showed that dropout in deep networks can be interpreted as approximate Bayesian inference; Monte Carlo dropout uses this to estimate predictive uncertainty. |
| SWAG | Maddox et al. (2019) fit a Gaussian over the trajectory of stochastic gradient descent iterates, giving a cheap approximate posterior over neural network weights. |
| Bayesian model selection | Bayes factors compare models by their marginal likelihood; the BIC is an asymptotic approximation. |
| Bayesian optimisation | Uses a Gaussian process surrogate for an expensive black-box function, and a Bayesian acquisition function to choose the next evaluation point. Heavily used for hyperparameter tuning. |
| Topic models | LDA and its extensions are pure Bayesian models with Dirichlet priors. |
| Bayesian reinforcement learning | Maintain a posterior over MDP dynamics or value functions; Thompson sampling is a Bayesian bandit strategy. |
| Calibration | Bayesian posterior predictive distributions are often well calibrated by construction; non-Bayesian deep nets typically need post-hoc calibration. |
| Uncertainty quantification | Bayesian models give a principled separation of aleatoric and epistemic uncertainty. |
Bayesian statistics is no longer a niche tool. It is mainstream across science, industry, and policy.
Bayesian methods have a number of properties that make them attractive in science and industry alike.
The approach also has real costs.
Bayesian methods have moved from the periphery to the centre of several fields. In cognitive science the Bayesian brain hypothesis treats perception and learning as approximate Bayesian inference. In industry, hierarchical Bayesian models are standard for marketing analytics, recommendation systems, and reliability engineering. In machine learning, Bayesian deep learning is an active research area motivated by uncertainty quantification, model calibration, and AI safety. In medical statistics the FDA has accepted Bayesian designs for confirmatory trials in selected settings, and adaptive Bayesian designs are routine in oncology platform trials. In the debate around the reproducibility crisis, prominent statisticians have argued that Bayesian model comparison and posterior reporting should partly replace null hypothesis significance testing.
None of this means the frequentist tradition has been displaced. Most introductory statistics courses still teach the Neyman-Pearson framework, and most published p-values come from frequentist procedures. What has changed is that Bayesian methods are now a serious alternative rather than a fringe view, with mature software and large user communities.