# Bayesian statistics

> Source: https://aiwiki.ai/wiki/bayesian_statistics
> Updated: 2026-05-01
> Categories: Mathematics, Statistics
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# Bayesian statistics

**Bayesian statistics** is a branch of statistics in which probability is interpreted as a degree of belief, and statistical inference is performed by updating those beliefs in light of new evidence using [Bayes' theorem](/wiki/bayes_theorem). It contrasts with frequentist statistics, where probability refers to the long-run frequency of an event in repeated experiments and parameters are treated as fixed but unknown constants. In the Bayesian framework, parameters are themselves random variables with probability distributions, and the central object of inference is the posterior distribution over those parameters given the observed data.

The approach is named for the Reverend Thomas Bayes, whose 1763 essay laid the mathematical foundations, but it took roughly two centuries, several philosophical revivals, and a computational revolution before Bayesian methods became practical at the scale modern science demands. Today the framework is used in clinical trial design, machine learning, cosmology, election forecasting, and large parts of cognitive science, and it underlies a thriving ecosystem of probabilistic programming languages such as Stan, PyMC, and NumPyro.

## The fundamental equation

All of Bayesian statistics rests on a single rearrangement of the definition of conditional probability. Given a hypothesis or parameter $H$ and observed data $D$:

$$P(H \mid D) = \frac{P(D \mid H)\,P(H)}{P(D)}$$

A more compact and frequently quoted form is:

$$\text{Posterior} \propto \text{Likelihood} \times \text{Prior}$$

The four ingredients each have a precise role:

| Term | Symbol | Meaning |
|------|--------|---------|
| Posterior | $P(H \mid D)$ | Updated probability of the hypothesis after seeing the data |
| Likelihood | $P(D \mid H)$ | Probability of the observed data assuming the hypothesis is true |
| [Prior belief](/wiki/prior_belief) | $P(H)$ | Probability assigned to the hypothesis before seeing the data |
| Marginal likelihood / evidence | $P(D) = \int P(D \mid H)\,P(H)\,dH$ | Normalising constant that makes the posterior integrate to one; also used for model comparison |

In most realistic problems the integral defining $P(D)$ has no closed form, so a great deal of Bayesian computation is devoted either to evaluating that integral, sidestepping it, or sampling from the unnormalised posterior $P(D \mid H)\,P(H)$ directly.

## Core concepts

Bayesian statistics has its own vocabulary, much of it shared with [Bayesian inference](/wiki/bayesian_inference) and probabilistic modelling more broadly.

| Concept | Description |
|---------|-------------|
| Prior | Distribution over parameters that encodes belief before seeing the data. May be informative (encoding domain knowledge) or uninformative or weakly informative (intended to let the data dominate). |
| Likelihood | Probability model linking parameters to data; the same object used in maximum likelihood estimation, but interpreted as a function of the parameters given fixed data. |
| Posterior | Distribution over parameters after conditioning on the observed data. Summaries include the posterior mean, median, mode (the maximum a posteriori or MAP estimate), and quantiles. |
| Marginal likelihood | $P(D)$, also called the model evidence. Used in Bayes factors and Bayesian model comparison. |
| Posterior predictive distribution | $P(D_{\text{new}} \mid D) = \int P(D_{\text{new}} \mid \theta)\,P(\theta \mid D)\,d\theta$, the predictive distribution for new data after marginalising over parameter uncertainty. |
| Hyperparameter | A parameter of the prior distribution itself; treating hyperparameters as random variables with their own priors gives rise to hierarchical models. |
| Conjugate prior | Prior chosen so that the posterior belongs to the same parametric family. Conjugacy yields closed-form posteriors and was historically the main way Bayesian problems were made tractable. |
| Credible interval | Interval that contains a specified amount of posterior probability (for example 95%). The Bayesian counterpart to a frequentist confidence interval, but with a direct probabilistic interpretation: there is a 95% probability the parameter lies in the interval given the data and prior. |
| Highest posterior density (HPD) region | The smallest region containing a given posterior probability mass; a common way of summarising multimodal posteriors. |

## History

The history of Bayesian statistics is unusually contested for a mathematical subject; the framework spent more than a century in the wilderness before re-emerging as a dominant approach.

The Reverend Thomas Bayes, an English Presbyterian minister, formulated his theorem some time in the late 1740s. He never published it. After his death in 1761 his friend Richard Price edited the manuscript and read it before the Royal Society in December 1763 as "An Essay towards solving a Problem in the Doctrine of Chances". The essay tackled the inverse probability problem of inferring an unknown chance from observed successes, and it included Price's own introduction explaining the philosophical motivation.

Pierre-Simon Laplace developed the same ideas independently and far more systematically. His 1812 "Théorie analytique des probabilités" made what we now call Bayesian inference a workhorse of nineteenth century science, applied to problems in astronomy, demography, and the reliability of legal testimony. For Laplace, probability was the natural language of partial knowledge, and assigning prior probabilities was simply a matter of insufficient reason.

The twentieth century saw a sharp turn against this view. Ronald Fisher, Jerzy Neyman, and Egon Pearson built the frequentist edifice between roughly 1920 and 1940: maximum likelihood, significance tests, p-values, confidence intervals, and the Neyman-Pearson framework for hypothesis testing. Their work pushed Bayesian methods to the margins of academic statistics, in part because the priors looked subjective and arbitrary, and in part because the resulting integrals were generally impossible to compute.

A quiet revival began in the 1930s and 1940s. Bruno de Finetti's 1937 paper "La prévision: ses lois logiques, ses sources subjectives" gave subjective probability a rigorous behavioural foundation, deriving the rules of probability from coherent betting odds and proving the famous representation theorem for exchangeable sequences. Leonard Savage's 1954 "The Foundations of Statistics" combined de Finetti's subjective probability with the von Neumann-Morgenstern theory of utility to produce the canonical axiomatisation of Bayesian decision theory. Edwin Jaynes argued through the 1950s and 1960s that probability theory is simply extended logic, a view eventually compiled in his posthumous 2003 book "Probability Theory: The Logic of Science".

The practical breakthrough was computational. The Metropolis algorithm appeared in a 1953 physics paper by Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller, and was generalised to non-symmetric proposals by W. K. Hastings in 1970. Stuart and Donald Geman introduced the [Gibbs sampler](/wiki/markov_chain_monte_carlo) in 1984 for image restoration. Once Alan Gelfand and Adrian Smith showed in 1990 how Gibbs sampling could be applied to mainstream Bayesian problems, [Markov Chain Monte Carlo (MCMC)](/wiki/mcmc) became the engine that let Bayesian inference handle realistic models. The 2014 paper by Matthew Hoffman and Andrew Gelman introduced the No-U-Turn Sampler (NUTS), an adaptive form of Hamiltonian Monte Carlo that powers modern Stan and freed users from manually tuning the number of leapfrog steps.

The last fifteen years brought probabilistic programming. Stan, PyMC, Pyro, NumPyro, and TensorFlow Probability turned model specification into something close to scripting, and they made Bayesian inference accessible outside the small community of statisticians willing to write their own samplers.

## Frequentist versus Bayesian

The two paradigms answer different questions and use the same data in different ways. The differences are real but often less stark in practice than the philosophical debates suggest.

| Issue | Frequentist | Bayesian |
|-------|-------------|----------|
| Probability interpretation | Long-run frequency in repeated trials | Degree of belief; a measure of uncertainty |
| Parameters | Fixed unknown constants | Random variables with distributions |
| Inference target | Estimators with good sampling properties | Posterior distributions over parameters |
| Interval estimate | Confidence interval (about the procedure) | Credible interval (about the parameter, given data) |
| Hypothesis testing | Null hypothesis significance testing, p-values | Posterior probabilities, Bayes factors, model comparison |
| Prior information | Not formally used in pure frequentist analysis | Encoded explicitly as a prior distribution |
| Computational cost | Usually low; closed-form or optimisation | Usually high; requires sampling or approximation of the posterior |
| Model checking | Goodness-of-fit tests, residual analysis | Posterior predictive checks, cross-validation |
| Sequential updating | Awkward; multiple testing corrections needed | Natural; today's posterior is tomorrow's prior |

A classic example: a frequentist 95% confidence interval is constructed so that, across many hypothetical repetitions of the experiment, 95% of intervals computed this way will contain the true parameter. A 95% Bayesian credible interval is one in which, given this particular data set and prior, the posterior probability that the parameter lies in the interval is 95%. The Bayesian statement is the one most people think a confidence interval means.

## Common Bayesian models

Most applied Bayesian work uses a small set of recurring building blocks. Many of the simplest models exploit conjugacy, where prior and likelihood are mathematically matched so the posterior has a known form.

| Model | Likelihood | Prior | Notes |
|-------|-----------|-------|-------|
| Beta-Binomial | Binomial | Beta | Canonical coin-flip example; posterior is Beta with updated parameters. |
| Normal-Normal | Normal with known variance | Normal | Posterior mean is a precision-weighted average of prior mean and sample mean. |
| Normal-Inverse-Gamma | Normal with unknown variance | Normal-Inverse-Gamma | Standard joint conjugate for mean and variance of a Gaussian. |
| Dirichlet-Multinomial | Multinomial | Dirichlet | Used for categorical data, language models, topic models. |
| Poisson-Gamma | Poisson | Gamma | Counts data; rate posterior is Gamma. |
| Bayesian linear regression | Gaussian likelihood on residuals | Gaussian or hierarchical priors on coefficients | Gives full posterior over coefficients and predictive intervals that account for parameter uncertainty. |
| Bayesian generalised linear models | GLM (logistic, Poisson, etc.) | Weakly informative priors | The bread and butter of applied Bayesian regression. |
| Hierarchical / multilevel models | Group-level likelihoods | Group-level parameters drawn from a population prior | Often called the killer app of Bayesian methods; partial pooling shrinks group estimates toward the population mean. |
| Gaussian processes | GP prior over functions | Kernel-controlled prior | Nonparametric regression and classification with calibrated uncertainty. |
| Mixture models | Weighted sum of component distributions | Dirichlet weights, component-wise priors | Used for density estimation and clustering; Dirichlet process mixtures extend to infinite components. |
| Latent Dirichlet Allocation (LDA) | Multinomial over words | Dirichlet over topics | A pure Bayesian topic model introduced by Blei, Ng, and Jordan in 2003. |
| Hidden Markov models with Bayesian priors | State transition and emission distributions | Dirichlet on transitions, conjugate priors on emissions | Used in speech recognition, biology, finance. |
| [Bayesian neural networks](/wiki/bayesian_neural_network) | Neural network likelihood | Distribution over weights | Replace point-estimate weights with distributions to get predictive uncertainty. |
| [Bayesian networks](/wiki/bayesian_network) | Factorised joint over a directed graph | Conditional probability tables, often with Dirichlet priors | Graphical model framework that includes [Naive Bayes](/wiki/naive_bayes) as a special case. |

## Computational methods

The computational story is the reason Bayesian methods went from a niche philosophical position in the 1970s to a default tool in the 2020s. There is no single way to compute a posterior; choice of method depends on the model size, structure, and how much accuracy you need.

| Method | What it does | Strengths | Limitations |
|--------|--------------|-----------|-------------|
| Conjugate priors | Closed-form posterior using algebraic conjugacy | Exact; instantaneous | Restricted to a small set of model families |
| Laplace approximation | Gaussian fit at the posterior mode | Fast; easy to implement | Poor for skewed or multimodal posteriors |
| Variational inference (VI) | Approximate posterior with a tractable family by maximising the evidence lower bound (ELBO) | Scales to large data; deterministic | Tends to underestimate variance; quality depends on the variational family |
| Mean-field VI | VI with independent factors per parameter | Especially fast | Ignores posterior correlations |
| Stochastic / black-box VI | VI using Monte Carlo gradients of the ELBO | Works with arbitrary models | Variance of gradient estimators can be high |
| Metropolis-Hastings (1953/1970) | Generic random-walk MCMC with accept/reject | Very general | Can mix slowly in high dimensions |
| Gibbs sampling (Geman & Geman 1984) | MCMC sampling from full conditionals | Easy when conditionals are tractable | Mixes slowly under strong correlations |
| Hamiltonian Monte Carlo / NUTS (Hoffman & Gelman 2014) | MCMC using gradient information from the log posterior | State of the art for continuous parameters; powers Stan | Requires differentiable models; expensive per step |
| Sequential Monte Carlo / particle filters | Recursively reweight and resample particles to track an evolving posterior | Natural for state-space models and online updates | Particle degeneracy in high dimensions |
| [Importance sampling](/wiki/importance_sampling) | Reweight samples from a proposal to estimate posterior expectations | Embarrassingly parallel | Fails badly when proposal is far from the posterior |
| Annealed importance sampling | Importance sampling along a temperature schedule | Useful for marginal likelihoods | Schedule design is delicate |
| INLA (Integrated Nested Laplace Approximation) | Combine Laplace approximations with numerical integration for latent Gaussian models | Very fast for the right model class | Restricted to latent Gaussian structure |

MCMC convergence diagnostics are a routine part of Bayesian workflow. Effective sample size, the Gelman-Rubin $\hat R$ statistic, trace plots, and posterior predictive checks are standard tools, and good Bayesian software reports them by default.

## Probabilistic programming

Probabilistic programming languages let users describe a generative model in code and leave the inference engine to handle posterior computation.

| Software | Language / backend | Notes |
|----------|--------------------|-------|
| Stan (Carpenter et al. 2017) | C++ engine with R, Python, Julia, command-line, and other interfaces | NUTS is the default sampler; widely used in academia, industry, and pharma. |
| PyMC (Salvatier, Wiecki & Fonnesbeck 2016) | Python; originally Theano, now PyTensor and JAX | Pythonic API; supports NUTS, variational inference, and JAX-backed samplers. |
| Pyro (Bingham et al. 2018) | Python on PyTorch | Built at Uber for deep generative models and stochastic variational inference. |
| NumPyro (Phan, Pradhan & Jankowiak 2019) | Python on JAX | Lightweight rewrite of Pyro with very fast NUTS and SVI on accelerators. |
| Edward / TensorFlow Probability | Python on TensorFlow | TFP is the actively maintained successor to Edward, used inside Google. |
| BUGS / WinBUGS / OpenBUGS | Custom DSL | The original Bayesian DSL from the 1990s; mostly historical now. |
| JAGS | C++ implementation of the BUGS language | Still common in ecology and social science. |
| Turing.jl | Julia | Composable inference algorithms in Julia. |

These tools share a common pattern: write the model once in a high-level language, get gradients automatically through differentiable programming, and run any of several inference algorithms without rewriting the model. That separation between model and inference is what made probabilistic programming possible.

## Bayesian connections to machine learning

Many machine learning techniques have a Bayesian reading, and several were originally derived from a Bayesian starting point.

| Topic | Bayesian connection |
|-------|---------------------|
| L2 regularisation | Equivalent to maximum a posteriori estimation under a Gaussian prior on the weights. |
| L1 regularisation (Lasso) | Equivalent to MAP estimation under a Laplace prior, which encourages sparsity. |
| Dropout | Yarin Gal's 2016 thesis showed that dropout in deep networks can be interpreted as approximate Bayesian inference; Monte Carlo dropout uses this to estimate predictive uncertainty. |
| SWAG | Maddox et al. (2019) fit a Gaussian over the trajectory of stochastic gradient descent iterates, giving a cheap approximate posterior over neural network weights. |
| Bayesian model selection | Bayes factors compare models by their marginal likelihood; the BIC is an asymptotic approximation. |
| [Bayesian optimisation](/wiki/bayesian_optimization) | Uses a Gaussian process surrogate for an expensive black-box function, and a Bayesian acquisition function to choose the next evaluation point. Heavily used for hyperparameter tuning. |
| Topic models | LDA and its extensions are pure Bayesian models with Dirichlet priors. |
| Bayesian reinforcement learning | Maintain a posterior over MDP dynamics or value functions; Thompson sampling is a Bayesian bandit strategy. |
| Calibration | Bayesian posterior predictive distributions are often well calibrated by construction; non-Bayesian deep nets typically need post-hoc calibration. |
| Uncertainty quantification | Bayesian models give a principled separation of aleatoric and epistemic uncertainty. |

## Real-world applications

Bayesian statistics is no longer a niche tool. It is mainstream across science, industry, and policy.

- Clinical trials and drug development. The U.S. Food and Drug Administration has issued guidance for Bayesian designs for medical devices, and adaptive Bayesian designs are used in oncology and rare-disease trials where data are scarce and prior information is valuable.
- A/B testing in technology companies. Bayesian-flavoured tests give a continuously updated probability that one variant is better than another, avoiding many of the multiple-testing pitfalls of repeated frequentist peeking.
- Genetics. BayesC, BayesR, and BSLMM are widely used Bayesian methods for genomic prediction and genome-wide association studies, where many small effects need to be regularised.
- Search and ranking. Bayesian linear regression and Thompson sampling are standard tools for click-through rate estimation and online ranking.
- Cancer staging and medical diagnosis, where posterior probabilities of disease given test results are the natural output of a diagnostic model.
- Financial risk modelling, including Value at Risk under parameter uncertainty and Bayesian forecasting of macroeconomic time series.
- Astrophysics. The LIGO and Virgo collaborations use Bayesian inference for gravitational wave parameter estimation; exoplanet detection routinely uses nested sampling and Bayesian model comparison.
- Marketing mix modelling, where hierarchical Bayesian regressions estimate the contribution of each channel under realistic priors and constraints.
- Election forecasting. Andrew Gelman's models for The Economist and Nate Silver's models at FiveThirtyEight blend polling, fundamentals, and prior information using hierarchical Bayesian regression.
- Reliability engineering, where Weibull models with Bayesian priors are used for failure-time analysis under sparse data.
- Forensic science, where likelihood ratios and Bayesian reasoning are the formal language of evidence interpretation in DNA analysis.

## Strengths

Bayesian methods have a number of properties that make them attractive in science and industry alike.

- Uncertainty is handled in a principled way. The posterior is itself a probability distribution, and predictions automatically account for parameter uncertainty.
- Updating with new data is coherent. The posterior after one batch of data is the prior for the next, which makes streaming and sequential analyses natural.
- Hierarchical models give partial pooling that interpolates smoothly between treating groups identically and treating them independently. This is one of the most consequential ideas in modern applied statistics.
- Decision theory plugs in directly: combine the posterior with a loss function and minimise expected loss to get an optimal decision.
- Posterior predictive checks let you criticise a model by simulating data from the fitted model and comparing it to the observed data.
- The framework does not rely on asymptotic approximations or large-sample assumptions. Inference for small samples is no different in principle from inference for large samples.

## Weaknesses

The approach also has real costs.

- Computation is expensive. Even with modern MCMC and variational inference, fitting a complex Bayesian model can take orders of magnitude longer than a comparable frequentist analysis.
- Prior selection can be controversial. Two analysts with different priors can reach different conclusions from the same data, and writing down a prior for an unfamiliar parameter is genuinely hard.
- Communication is harder. Many practitioners and most regulators are trained in frequentist language, and credible intervals, posterior probabilities, and Bayes factors take more explanation than p-values.
- MCMC convergence diagnostics matter. A poorly mixed chain can give misleading answers, and detecting that requires care.
- The framework assumes parameters have a meaningful probability distribution, which some philosophers and statisticians regard as a strong metaphysical commitment.

## Modern resurgence

Bayesian methods have moved from the periphery to the centre of several fields. In cognitive science the Bayesian brain hypothesis treats perception and learning as approximate Bayesian inference. In industry, hierarchical Bayesian models are standard for marketing analytics, recommendation systems, and reliability engineering. In machine learning, Bayesian deep learning is an active research area motivated by uncertainty quantification, model calibration, and AI safety. In medical statistics the FDA has accepted Bayesian designs for confirmatory trials in selected settings, and adaptive Bayesian designs are routine in oncology platform trials. In the debate around the reproducibility crisis, prominent statisticians have argued that Bayesian model comparison and posterior reporting should partly replace null hypothesis significance testing.

None of this means the frequentist tradition has been displaced. Most introductory statistics courses still teach the Neyman-Pearson framework, and most published p-values come from frequentist procedures. What has changed is that Bayesian methods are now a serious alternative rather than a fringe view, with mature software and large user communities.

## References

- Bayes, T. (1763). "An Essay towards solving a Problem in the Doctrine of Chances." *Philosophical Transactions of the Royal Society of London*, 53, 370-418. Communicated posthumously by Richard Price.
- Bingham, E., Chen, J. P., Jankowiak, M., Obermeyer, F., Pradhan, N., Karaletsos, T., Singh, R., Szerlip, P., Horsfall, P., & Goodman, N. D. (2018). "Pyro: Deep Universal Probabilistic Programming." *Journal of Machine Learning Research*.
- Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer.
- Box, G. E. P., & Tiao, G. C. (1973). *Bayesian Inference in Statistical Analysis*. Addison-Wesley.
- Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., & Riddell, A. (2017). "Stan: A Probabilistic Programming Language." *Journal of Statistical Software*, 76(1).
- de Finetti, B. (1937). "La prévision: ses lois logiques, ses sources subjectives." *Annales de l'Institut Henri Poincaré*, 7, 1-68. English translation: "Foresight: Its Logical Laws, Its Subjective Sources."
- Gal, Y., & Ghahramani, Z. (2016). "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning." *Proceedings of the 33rd International Conference on Machine Learning*.
- Gelfand, A. E., & Smith, A. F. M. (1990). "Sampling-Based Approaches to Calculating Marginal Densities." *Journal of the American Statistical Association*, 85(410).
- Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). *Bayesian Data Analysis*, 3rd edition. Chapman & Hall / CRC Press.
- Geman, S., & Geman, D. (1984). "Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, PAMI-6(6).
- Hastings, W. K. (1970). "Monte Carlo Sampling Methods Using Markov Chains and Their Applications." *Biometrika*, 57(1).
- Hoffman, M. D., & Gelman, A. (2014). "The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo." *Journal of Machine Learning Research*, 15.
- Jaynes, E. T. (2003). *Probability Theory: The Logic of Science*. Cambridge University Press. Edited by G. Larry Bretthorst.
- Laplace, P.-S. (1812). *Théorie analytique des probabilités*. Courcier.
- Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., & Wilson, A. G. (2019). "A Simple Baseline for Bayesian Uncertainty in Deep Learning." *Advances in Neural Information Processing Systems*.
- McElreath, R. (2020). *Statistical Rethinking: A Bayesian Course with Examples in R and Stan*, 2nd edition. Chapman & Hall / CRC Press.
- Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). "Equation of State Calculations by Fast Computing Machines." *Journal of Chemical Physics*, 21(6).
- Murphy, K. P. (2022). *Probabilistic Machine Learning: An Introduction*. MIT Press.
- Phan, D., Pradhan, N., & Jankowiak, M. (2019). "Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro." arXiv preprint.
- Robert, C. P. (2007). *The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation*, 2nd edition. Springer.
- Salvatier, J., Wiecki, T. V., & Fonnesbeck, C. (2016). "Probabilistic Programming in Python Using PyMC3." *PeerJ Computer Science*, 2.
- Savage, L. J. (1954). *The Foundations of Statistics*. John Wiley & Sons.

