Bayesian statistics

Bayesian statistics is a branch of statistics in which probability is interpreted as a degree of belief, and statistical inference is performed by updating those beliefs in light of new evidence using Bayes' theorem. It contrasts with frequentist statistics, where probability refers to the long-run frequency of an event in repeated experiments and parameters are treated as fixed but unknown constants. In the Bayesian framework, parameters are themselves random variables with probability distributions, and the central object of inference is the posterior distribution over those parameters given the observed data.

The approach is named for the Reverend Thomas Bayes, whose 1763 essay laid the mathematical foundations, but it took roughly two centuries, several philosophical revivals, and a computational revolution before Bayesian methods became practical at the scale modern science demands. Today the framework is used in clinical trial design, machine learning, cosmology, election forecasting, and large parts of cognitive science, and it underlies a thriving ecosystem of probabilistic programming languages such as Stan, PyMC, and NumPyro.

The fundamental equation

All of Bayesian statistics rests on a single rearrangement of the definition of conditional probability. Given a hypothesis or parameter $H$ and observed data $D$:

$$P(H \mid D) = \frac{P(D \mid H),P(H)}{P(D)}$$

A more compact and frequently quoted form is:

$$\text{Posterior} \propto \text{Likelihood} \times \text{Prior}$$

The four ingredients each have a precise role:

Term	Symbol	Meaning
Posterior	$P(H \mid D)$	Updated probability of the hypothesis after seeing the data
Likelihood	$P(D \mid H)$	Probability of the observed data assuming the hypothesis is true
Prior belief	$P(H)$	Probability assigned to the hypothesis before seeing the data
Marginal likelihood / evidence	$P(D) = \int P(D \mid H),P(H),dH$	Normalising constant that makes the posterior integrate to one; also used for model comparison

In most realistic problems the integral defining $P(D)$ has no closed form, so a great deal of Bayesian computation is devoted either to evaluating that integral, sidestepping it, or sampling from the unnormalised posterior $P(D \mid H),P(H)$ directly.

Core concepts

Bayesian statistics has its own vocabulary, much of it shared with Bayesian inference and probabilistic modelling more broadly.

Concept	Description
Prior	Distribution over parameters that encodes belief before seeing the data. May be informative (encoding domain knowledge) or uninformative or weakly informative (intended to let the data dominate).
Likelihood	Probability model linking parameters to data; the same object used in maximum likelihood estimation, but interpreted as a function of the parameters given fixed data.
Posterior	Distribution over parameters after conditioning on the observed data. Summaries include the posterior mean, median, mode (the maximum a posteriori or MAP estimate), and quantiles.
Marginal likelihood	$P(D)$, also called the model evidence. Used in Bayes factors and Bayesian model comparison.
Posterior predictive distribution	$P(D_{\text{new}} \mid D) = \int P(D_{\text{new}} \mid \theta),P(\theta \mid D),d\theta$, the predictive distribution for new data after marginalising over parameter uncertainty.
Hyperparameter	A parameter of the prior distribution itself; treating hyperparameters as random variables with their own priors gives rise to hierarchical models.
Conjugate prior	Prior chosen so that the posterior belongs to the same parametric family. Conjugacy yields closed-form posteriors and was historically the main way Bayesian problems were made tractable.
Credible interval	Interval that contains a specified amount of posterior probability (for example 95%). The Bayesian counterpart to a frequentist confidence interval, but with a direct probabilistic interpretation: there is a 95% probability the parameter lies in the interval given the data and prior.
Highest posterior density (HPD) region	The smallest region containing a given posterior probability mass; a common way of summarising multimodal posteriors.

History

The history of Bayesian statistics is unusually contested for a mathematical subject; the framework spent more than a century in the wilderness before re-emerging as a dominant approach.

The Reverend Thomas Bayes, an English Presbyterian minister, formulated his theorem some time in the late 1740s. He never published it. After his death in 1761 his friend Richard Price edited the manuscript and read it before the Royal Society in December 1763 as "An Essay towards solving a Problem in the Doctrine of Chances". The essay tackled the inverse probability problem of inferring an unknown chance from observed successes, and it included Price's own introduction explaining the philosophical motivation.

Pierre-Simon Laplace developed the same ideas independently and far more systematically. His 1812 "Théorie analytique des probabilités" made what we now call Bayesian inference a workhorse of nineteenth century science, applied to problems in astronomy, demography, and the reliability of legal testimony. For Laplace, probability was the natural language of partial knowledge, and assigning prior probabilities was simply a matter of insufficient reason.

The twentieth century saw a sharp turn against this view. Ronald Fisher, Jerzy Neyman, and Egon Pearson built the frequentist edifice between roughly 1920 and 1940: maximum likelihood, significance tests, p-values, confidence intervals, and the Neyman-Pearson framework for hypothesis testing. Their work pushed Bayesian methods to the margins of academic statistics, in part because the priors looked subjective and arbitrary, and in part because the resulting integrals were generally impossible to compute.

A quiet revival began in the 1930s and 1940s. Bruno de Finetti's 1937 paper "La prévision: ses lois logiques, ses sources subjectives" gave subjective probability a rigorous behavioural foundation, deriving the rules of probability from coherent betting odds and proving the famous representation theorem for exchangeable sequences. Leonard Savage's 1954 "The Foundations of Statistics" combined de Finetti's subjective probability with the von Neumann-Morgenstern theory of utility to produce the canonical axiomatisation of Bayesian decision theory. Edwin Jaynes argued through the 1950s and 1960s that probability theory is simply extended logic, a view eventually compiled in his posthumous 2003 book "Probability Theory: The Logic of Science".

The practical breakthrough was computational. The Metropolis algorithm appeared in a 1953 physics paper by Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller, and was generalised to non-symmetric proposals by W. K. Hastings in 1970. Stuart and Donald Geman introduced the Gibbs sampler in 1984 for image restoration. Once Alan Gelfand and Adrian Smith showed in 1990 how Gibbs sampling could be applied to mainstream Bayesian problems, Markov Chain Monte Carlo (MCMC) became the engine that let Bayesian inference handle realistic models. The 2014 paper by Matthew Hoffman and Andrew Gelman introduced the No-U-Turn Sampler (NUTS), an adaptive form of Hamiltonian Monte Carlo that powers modern Stan and freed users from manually tuning the number of leapfrog steps.

The last fifteen years brought probabilistic programming. Stan, PyMC, Pyro, NumPyro, and TensorFlow Probability turned model specification into something close to scripting, and they made Bayesian inference accessible outside the small community of statisticians willing to write their own samplers.

Frequentist versus Bayesian

The two paradigms answer different questions and use the same data in different ways. The differences are real but often less stark in practice than the philosophical debates suggest.

Issue	Frequentist	Bayesian
Probability interpretation	Long-run frequency in repeated trials	Degree of belief; a measure of uncertainty
Parameters	Fixed unknown constants	Random variables with distributions
Inference target	Estimators with good sampling properties	Posterior distributions over parameters
Interval estimate	Confidence interval (about the procedure)	Credible interval (about the parameter, given data)
Hypothesis testing	Null hypothesis significance testing, p-values	Posterior probabilities, Bayes factors, model comparison
Prior information	Not formally used in pure frequentist analysis	Encoded explicitly as a prior distribution
Computational cost	Usually low; closed-form or optimisation	Usually high; requires sampling or approximation of the posterior
Model checking	Goodness-of-fit tests, residual analysis	Posterior predictive checks, cross-validation
Sequential updating	Awkward; multiple testing corrections needed	Natural; today's posterior is tomorrow's prior

A classic example: a frequentist 95% confidence interval is constructed so that, across many hypothetical repetitions of the experiment, 95% of intervals computed this way will contain the true parameter. A 95% Bayesian credible interval is one in which, given this particular data set and prior, the posterior probability that the parameter lies in the interval is 95%. The Bayesian statement is the one most people think a confidence interval means.

Common Bayesian models

Most applied Bayesian work uses a small set of recurring building blocks. Many of the simplest models exploit conjugacy, where prior and likelihood are mathematically matched so the posterior has a known form.

Model	Likelihood	Prior	Notes
Beta-Binomial	Binomial	Beta	Canonical coin-flip example; posterior is Beta with updated parameters.
Normal-Normal	Normal with known variance	Normal	Posterior mean is a precision-weighted average of prior mean and sample mean.
Normal-Inverse-Gamma	Normal with unknown variance	Normal-Inverse-Gamma	Standard joint conjugate for mean and variance of a Gaussian.
Dirichlet-Multinomial	Multinomial	Dirichlet	Used for categorical data, language models, topic models.
Poisson-Gamma	Poisson	Gamma	Counts data; rate posterior is Gamma.
Bayesian linear regression	Gaussian likelihood on residuals	Gaussian or hierarchical priors on coefficients	Gives full posterior over coefficients and predictive intervals that account for parameter uncertainty.
Bayesian generalised linear models	GLM (logistic, Poisson, etc.)	Weakly informative priors	The bread and butter of applied Bayesian regression.
Hierarchical / multilevel models	Group-level likelihoods	Group-level parameters drawn from a population prior	Often called the killer app of Bayesian methods; partial pooling shrinks group estimates toward the population mean.
Gaussian processes	GP prior over functions	Kernel-controlled prior	Nonparametric regression and classification with calibrated uncertainty.
Mixture models	Weighted sum of component distributions	Dirichlet weights, component-wise priors	Used for density estimation and clustering; Dirichlet process mixtures extend to infinite components.
Latent Dirichlet Allocation (LDA)	Multinomial over words	Dirichlet over topics	A pure Bayesian topic model introduced by Blei, Ng, and Jordan in 2003.
Hidden Markov models with Bayesian priors	State transition and emission distributions	Dirichlet on transitions, conjugate priors on emissions	Used in speech recognition, biology, finance.
Bayesian neural networks	Neural network likelihood	Distribution over weights	Replace point-estimate weights with distributions to get predictive uncertainty.
Bayesian networks	Factorised joint over a directed graph	Conditional probability tables, often with Dirichlet priors	Graphical model framework that includes Naive Bayes as a special case.

Computational methods

The computational story is the reason Bayesian methods went from a niche philosophical position in the 1970s to a default tool in the 2020s. There is no single way to compute a posterior; choice of method depends on the model size, structure, and how much accuracy you need.

Method	What it does	Strengths	Limitations
Conjugate priors	Closed-form posterior using algebraic conjugacy	Exact; instantaneous	Restricted to a small set of model families
Laplace approximation	Gaussian fit at the posterior mode	Fast; easy to implement	Poor for skewed or multimodal posteriors
Variational inference (VI)	Approximate posterior with a tractable family by maximising the evidence lower bound (ELBO)	Scales to large data; deterministic	Tends to underestimate variance; quality depends on the variational family
Mean-field VI	VI with independent factors per parameter	Especially fast	Ignores posterior correlations
Stochastic / black-box VI	VI using Monte Carlo gradients of the ELBO	Works with arbitrary models	Variance of gradient estimators can be high
Metropolis-Hastings (1953/1970)	Generic random-walk MCMC with accept/reject	Very general	Can mix slowly in high dimensions
Gibbs sampling (Geman & Geman 1984)	MCMC sampling from full conditionals	Easy when conditionals are tractable	Mixes slowly under strong correlations
Hamiltonian Monte Carlo / NUTS (Hoffman & Gelman 2014)	MCMC using gradient information from the log posterior	State of the art for continuous parameters; powers Stan	Requires differentiable models; expensive per step
Sequential Monte Carlo / particle filters	Recursively reweight and resample particles to track an evolving posterior	Natural for state-space models and online updates	Particle degeneracy in high dimensions
Importance sampling	Reweight samples from a proposal to estimate posterior expectations	Embarrassingly parallel	Fails badly when proposal is far from the posterior
Annealed importance sampling	Importance sampling along a temperature schedule	Useful for marginal likelihoods	Schedule design is delicate
INLA (Integrated Nested Laplace Approximation)	Combine Laplace approximations with numerical integration for latent Gaussian models	Very fast for the right model class	Restricted to latent Gaussian structure

MCMC convergence diagnostics are a routine part of Bayesian workflow. Effective sample size, the Gelman-Rubin $\hat R$ statistic, trace plots, and posterior predictive checks are standard tools, and good Bayesian software reports them by default.

Probabilistic programming

Probabilistic programming languages let users describe a generative model in code and leave the inference engine to handle posterior computation.

Software	Language / backend	Notes
Stan (Carpenter et al. 2017)	C++ engine with R, Python, Julia, command-line, and other interfaces	NUTS is the default sampler; widely used in academia, industry, and pharma.
PyMC (Salvatier, Wiecki & Fonnesbeck 2016)	Python; originally Theano, now PyTensor and JAX	Pythonic API; supports NUTS, variational inference, and JAX-backed samplers.
Pyro (Bingham et al. 2018)	Python on PyTorch	Built at Uber for deep generative models and stochastic variational inference.
NumPyro (Phan, Pradhan & Jankowiak 2019)	Python on JAX	Lightweight rewrite of Pyro with very fast NUTS and SVI on accelerators.
Edward / TensorFlow Probability	Python on TensorFlow	TFP is the actively maintained successor to Edward, used inside Google.
BUGS / WinBUGS / OpenBUGS	Custom DSL	The original Bayesian DSL from the 1990s; mostly historical now.
JAGS	C++ implementation of the BUGS language	Still common in ecology and social science.
Turing.jl	Julia	Composable inference algorithms in Julia.

These tools share a common pattern: write the model once in a high-level language, get gradients automatically through differentiable programming, and run any of several inference algorithms without rewriting the model. That separation between model and inference is what made probabilistic programming possible.

Bayesian connections to machine learning

Many machine learning techniques have a Bayesian reading, and several were originally derived from a Bayesian starting point.

Topic	Bayesian connection
L2 regularisation	Equivalent to maximum a posteriori estimation under a Gaussian prior on the weights.
L1 regularisation (Lasso)	Equivalent to MAP estimation under a Laplace prior, which encourages sparsity.
Dropout	Yarin Gal's 2016 thesis showed that dropout in deep networks can be interpreted as approximate Bayesian inference; Monte Carlo dropout uses this to estimate predictive uncertainty.
SWAG	Maddox et al. (2019) fit a Gaussian over the trajectory of stochastic gradient descent iterates, giving a cheap approximate posterior over neural network weights.
Bayesian model selection	Bayes factors compare models by their marginal likelihood; the BIC is an asymptotic approximation.
Bayesian optimisation	Uses a Gaussian process surrogate for an expensive black-box function, and a Bayesian acquisition function to choose the next evaluation point. Heavily used for hyperparameter tuning.
Topic models	LDA and its extensions are pure Bayesian models with Dirichlet priors.
Bayesian reinforcement learning	Maintain a posterior over MDP dynamics or value functions; Thompson sampling is a Bayesian bandit strategy.
Calibration	Bayesian posterior predictive distributions are often well calibrated by construction; non-Bayesian deep nets typically need post-hoc calibration.
Uncertainty quantification	Bayesian models give a principled separation of aleatoric and epistemic uncertainty.

Real-world applications

Bayesian statistics is no longer a niche tool. It is mainstream across science, industry, and policy.

Clinical trials and drug development. The U.S. Food and Drug Administration has issued guidance for Bayesian designs for medical devices, and adaptive Bayesian designs are used in oncology and rare-disease trials where data are scarce and prior information is valuable.
A/B testing in technology companies. Bayesian-flavoured tests give a continuously updated probability that one variant is better than another, avoiding many of the multiple-testing pitfalls of repeated frequentist peeking.
Genetics. BayesC, BayesR, and BSLMM are widely used Bayesian methods for genomic prediction and genome-wide association studies, where many small effects need to be regularised.
Search and ranking. Bayesian linear regression and Thompson sampling are standard tools for click-through rate estimation and online ranking.
Cancer staging and medical diagnosis, where posterior probabilities of disease given test results are the natural output of a diagnostic model.
Financial risk modelling, including Value at Risk under parameter uncertainty and Bayesian forecasting of macroeconomic time series.
Astrophysics. The LIGO and Virgo collaborations use Bayesian inference for gravitational wave parameter estimation; exoplanet detection routinely uses nested sampling and Bayesian model comparison.
Marketing mix modelling, where hierarchical Bayesian regressions estimate the contribution of each channel under realistic priors and constraints.
Election forecasting. Andrew Gelman's models for The Economist and Nate Silver's models at FiveThirtyEight blend polling, fundamentals, and prior information using hierarchical Bayesian regression.
Reliability engineering, where Weibull models with Bayesian priors are used for failure-time analysis under sparse data.
Forensic science, where likelihood ratios and Bayesian reasoning are the formal language of evidence interpretation in DNA analysis.

Strengths

Bayesian methods have a number of properties that make them attractive in science and industry alike.

Uncertainty is handled in a principled way. The posterior is itself a probability distribution, and predictions automatically account for parameter uncertainty.
Updating with new data is coherent. The posterior after one batch of data is the prior for the next, which makes streaming and sequential analyses natural.
Hierarchical models give partial pooling that interpolates smoothly between treating groups identically and treating them independently. This is one of the most consequential ideas in modern applied statistics.
Decision theory plugs in directly: combine the posterior with a loss function and minimise expected loss to get an optimal decision.
Posterior predictive checks let you criticise a model by simulating data from the fitted model and comparing it to the observed data.
The framework does not rely on asymptotic approximations or large-sample assumptions. Inference for small samples is no different in principle from inference for large samples.

Weaknesses

The approach also has real costs.

Computation is expensive. Even with modern MCMC and variational inference, fitting a complex Bayesian model can take orders of magnitude longer than a comparable frequentist analysis.
Prior selection can be controversial. Two analysts with different priors can reach different conclusions from the same data, and writing down a prior for an unfamiliar parameter is genuinely hard.
Communication is harder. Many practitioners and most regulators are trained in frequentist language, and credible intervals, posterior probabilities, and Bayes factors take more explanation than p-values.
MCMC convergence diagnostics matter. A poorly mixed chain can give misleading answers, and detecting that requires care.
The framework assumes parameters have a meaningful probability distribution, which some philosophers and statisticians regard as a strong metaphysical commitment.

Modern resurgence

Bayesian methods have moved from the periphery to the centre of several fields. In cognitive science the Bayesian brain hypothesis treats perception and learning as approximate Bayesian inference. In industry, hierarchical Bayesian models are standard for marketing analytics, recommendation systems, and reliability engineering. In machine learning, Bayesian deep learning is an active research area motivated by uncertainty quantification, model calibration, and AI safety. In medical statistics the FDA has accepted Bayesian designs for confirmatory trials in selected settings, and adaptive Bayesian designs are routine in oncology platform trials. In the debate around the reproducibility crisis, prominent statisticians have argued that Bayesian model comparison and posterior reporting should partly replace null hypothesis significance testing.

None of this means the frequentist tradition has been displaced. Most introductory statistics courses still teach the Neyman-Pearson framework, and most published p-values come from frequentist procedures. What has changed is that Bayesian methods are now a serious alternative rather than a fringe view, with mature software and large user communities.

Bayesian statistics

Bayesian statistics

The fundamental equation

Core concepts

History

Frequentist versus Bayesian

Common Bayesian models

Computational methods

Probabilistic programming

Bayesian connections to machine learning

Real-world applications

Strengths

Weaknesses

Modern resurgence

References

Improve this article

Bayesian statistics

The fundamental equation

Core concepts

History

Frequentist versus Bayesian

Common Bayesian models

Computational methods

Probabilistic programming

Bayesian connections to machine learning

Real-world applications

Strengths

Weaknesses

Modern resurgence

References

Bayesian statistics

The fundamental equation

Core concepts

History

Frequentist versus Bayesian

Common Bayesian models

Computational methods

Probabilistic programming

Bayesian connections to machine learning

Real-world applications

Strengths

Weaknesses

Modern resurgence

References

Improve this article

Related Articles

Probabilistic Regression Model

Statistical learning theory

Topic model

Markov chain

Inductive bias

A/B Testing

Bayesian statistics

The fundamental equation

Core concepts

History

Frequentist versus Bayesian

Common Bayesian models

Computational methods

Probabilistic programming

Bayesian connections to machine learning

Real-world applications

Strengths

Weaknesses

Modern resurgence

References

Related Articles

Probabilistic Regression Model

Statistical learning theory

Topic model

Markov chain

Inductive bias

A/B Testing