Bayesian inference

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability of a hypothesis as new evidence or data becomes available. It treats unknown parameters as random variables described by probability distributions, combines prior beliefs with observed data through a likelihood function, and produces a posterior distribution that quantifies the remaining uncertainty about those parameters. Bayesian inference forms the conceptual backbone of large parts of modern machine learning, probabilistic modeling, decision theory, and scientific data analysis. It powers techniques ranging from spam filters and recommendation systems to Bayesian optimization of large language model hyperparameters and uncertainty quantification in safety-critical AI systems.

The approach contrasts with classical, or frequentist, inference, which treats parameters as fixed unknown constants and reasons about the long-run behavior of estimators across many hypothetical repetitions of an experiment. In the Bayesian view, probability is a measure of degree of belief, not a frequency. This interpretive shift has practical consequences: Bayesian methods naturally handle small samples, sequential data arrival, hierarchical structure, and the propagation of uncertainty through complex models. The framework has been formalized over more than two centuries, beginning with Thomas Bayes in the eighteenth century, generalized by Pierre-Simon Laplace, and revived in the twentieth century through the work of Harold Jeffreys, Dennis Lindley, Bruno de Finetti, Edwin Jaynes, and Judea Pearl.

Definition and components

The basic statement of Bayesian inference is given by Bayes' rule:

P(theta | D) = P(D | theta) * P(theta) / P(D)

Here theta represents the unknown parameter or hypothesis, and D represents the observed data. The four quantities have specific names that recur throughout the Bayesian literature.

Symbol	Name	Meaning
P(theta)	Prior	Belief about theta before seeing the data
P(D \| theta)	Likelihood	Probability of the data given a specific value of theta
P(theta \| D)	Posterior	Updated belief about theta after observing the data
P(D)	Evidence (marginal likelihood)	Probability of the data, averaged over all values of theta

The posterior is proportional to the prior multiplied by the likelihood. The evidence in the denominator acts as a normalizing constant that ensures the posterior integrates to one. In many problems the evidence is intractable because it requires integration over the entire parameter space, and most computational machinery in modern Bayesian work exists to circumvent this integral.

Prior

The prior distribution encodes what is known or assumed about the parameter before any data has been observed. Priors can be informative, drawing on previous studies, expert knowledge, or theoretical constraints, or they can be weakly informative or non-informative when the analyst wants the data to dominate. Common forms include flat (uniform) priors over a bounded range, conjugate priors chosen for mathematical convenience, hierarchical priors that share information across groups, and weakly regularizing priors such as the half-normal or half-Cauchy distributions that Andrew Gelman has popularized for variance parameters. The choice of prior is one of the most debated aspects of Bayesian methodology and is often the focal point of objections from frequentist statisticians.

Likelihood

The likelihood function is the same object used in classical maximum likelihood estimation. It expresses the probability of the observed data as a function of the unknown parameter. In Bayesian inference the likelihood is multiplied by the prior rather than maximized directly. The likelihood principle, which states that all evidence about theta from data D is contained in the likelihood function, is honored by Bayesian inference and routinely violated by certain frequentist procedures such as fixed-sample-size hypothesis tests with stopping rules.

Posterior

The posterior distribution is the central object of Bayesian inference. From it one can extract point estimates such as the posterior mean, median, or maximum a posteriori (MAP) value, interval estimates such as credible intervals, predictive distributions for new observations, decision-theoretic actions that minimize expected loss, and model comparisons via Bayes factors or marginal likelihoods. Unlike a frequentist confidence interval, a 95% Bayesian credible interval has the direct interpretation that there is a 95% probability the parameter lies in the interval given the observed data and the chosen prior.

Evidence

The evidence, also called the marginal likelihood, is the probability of the observed data integrated over all possible parameter values weighted by the prior. It plays two roles. As a normalizer it makes the posterior a proper probability distribution. As a model comparison tool it provides the basis for the Bayes factor, which compares two competing models by the ratio of their evidences. The evidence rewards models that fit the data well while penalizing unnecessary complexity, an automatic Occam's razor effect that does not require an explicit complexity term.

A simple worked example

A standard introduction is the coin flipping problem. Suppose a coin has unknown probability theta of landing heads, and the analyst observes seven heads in ten flips. With a uniform prior on theta over [0, 1], the posterior is a Beta(8, 4) distribution, with mean 8 / 12, or about 0.67. The maximum likelihood estimate would be 0.7. The two answers are similar because the uniform prior carries little information, but if the analyst had used a Beta(50, 50) prior reflecting strong prior belief that the coin is fair, the posterior would shift back toward 0.5. As more flips accumulate the prior matters less and the posterior converges to a point mass at the true value of theta, a property formalized by the Bernstein-von Mises theorem.

History

The theorem at the core of Bayesian inference was developed by the English Presbyterian minister Thomas Bayes and published posthumously in 1763 by his friend Richard Price under the title "An Essay towards solving a Problem in the Doctrine of Chances." Bayes himself worked out the special case of a uniform prior on a Bernoulli probability. The general formulation, including the use of arbitrary priors and the treatment of multidimensional problems, was developed independently and far more comprehensively by Pierre-Simon Laplace, who applied it to celestial mechanics, demography, and jurisprudence between roughly 1774 and 1814. Laplace's 1814 "Essai philosophique sur les probabilites" is often regarded as the first explicit Bayesian methodological text.

Throughout the nineteenth and early twentieth centuries Bayesian methods were practiced under various names but came under sharp criticism from Ronald Fisher, Jerzy Neyman, and Egon Pearson, the architects of frequentist inference. Fisher's introduction of the p-value, fiducial inference, and the maximum likelihood method displaced Bayesian techniques in much of mainstream statistics by the 1930s. The Bayesian school survived through the writings of Harold Jeffreys, whose 1939 book "Theory of Probability" provided a systematic treatment of objective priors, and through Bruno de Finetti's foundational work on subjective probability and exchangeability.

A modern revival began in the 1950s and 1960s. Leonard Savage's 1954 "The Foundations of Statistics" rebuilt decision theory on Bayesian foundations. Dennis Lindley championed Bayesian methods at British universities and helped train a generation of practitioners. Edwin Jaynes, working in physics, advocated maximum entropy priors and a logical interpretation of probability. The publication of Judea Pearl's "Probabilistic Reasoning in Intelligent Systems" in 1988 brought Bayesian networks to artificial intelligence and laid the groundwork for the graphical model revolution.

The practical revolution came with computation. Adrian Smith and his collaborators in the early 1990s, building on the Geman brothers' 1984 paper on Gibbs sampling, demonstrated that Markov chain Monte Carlo could make complex Bayesian models routinely tractable. The release of BUGS in 1989 and WinBUGS in 1997 democratized MCMC. Stan, released in 2012 and built on Hamiltonian Monte Carlo with the No-U-Turn Sampler, brought Bayesian inference to a much wider audience. By the 2010s Bayesian methods were standard in genetics, ecology, astronomy, epidemiology, and increasingly in industrial machine learning.

Frequentist versus Bayesian inference

The long-running debate between frequentist and Bayesian statisticians is partly philosophical and partly practical. The two schools agree on most numerical answers in simple problems with weak priors but diverge on interpretation, methodology in complex problems, and treatment of nuisance parameters.

Aspect	Frequentist	Bayesian
Probability interpretation	Long-run frequency in repeated trials	Degree of belief or rational uncertainty
Parameters	Fixed unknown constants	Random variables with distributions
Inference output	Point estimates, p-values, confidence intervals	Posterior distributions, credible intervals
Prior information	Generally not used explicitly	Encoded in prior distribution
Sequential analysis	Requires special procedures	Natural and coherent
Hierarchical models	Difficult or ad hoc	Natural via hyperpriors
Small samples	Often unreliable	Coherent if priors are reasonable
Computational cost	Usually low	Often high without specialized methods
Subjectivity	Hidden in choice of test, stopping rule, model	Made explicit in the prior

A 95% confidence interval has a definition that is awkward to convey: if the experiment were repeated infinitely many times, 95% of the constructed intervals would contain the true parameter. A 95% credible interval has the direct interpretation that the parameter is in the interval with probability 0.95 given the data and prior. Most working scientists, when pressed, treat their confidence intervals as if they were credible intervals, which is one of the practical arguments for the Bayesian framework.

The debate has cooled since the 2000s. Many statisticians now describe themselves as pragmatic, choosing methods to fit problems rather than ideology. The deep learning era has seen a partial Bayesian revival because uncertainty quantification matters in autonomous driving, medical AI, and reinforcement learning, and frequentist tools struggle to provide it for over-parameterized neural networks.

Conjugate priors

Conjugate priors are pairs of prior and likelihood distributions chosen so that the posterior remains in the same family as the prior. They make Bayesian updating analytically tractable and were the workhorse of pre-computational Bayesian work. Three classical conjugate pairs continue to appear constantly in modern applications.

Beta-Binomial

If the likelihood is binomial, modeling the number of successes in a fixed number of independent trials with success probability theta, then the conjugate prior is the Beta distribution. With prior Beta(alpha, beta) and observed data of k successes in n trials, the posterior is Beta(alpha + k, beta + n - k). The Beta-Binomial pair underlies Bayesian A/B testing, click-through rate estimation, multi-armed bandit algorithms such as Thompson sampling, and conversion rate analysis in product analytics.

Gamma-Poisson

For count data modeled with a Poisson likelihood with rate parameter lambda, the conjugate prior is the Gamma distribution. With prior Gamma(alpha, beta) and observed count sum s over n observations, the posterior is Gamma(alpha + s, beta + n). The Gamma-Poisson pair appears in queueing analysis, insurance claim modeling, defect counting, citation analysis, and any application that counts independent events occurring at an unknown rate.

Normal-Normal

For a normal likelihood with known variance and unknown mean, the conjugate prior on the mean is also a normal distribution. The posterior mean is a precision-weighted average of the prior mean and the data mean, and the posterior precision is the sum of the prior precision and the data precision. This pair generalizes to the multivariate normal and to the normal-inverse-gamma when both mean and variance are unknown. It is the building block of Kalman filters and Bayesian linear regression.

Conjugate priors lose their dominance once models become hierarchical, nonlinear, or high-dimensional, because the analytical convenience disappears. They survive as components inside larger Gibbs samplers, where each conditional distribution may still be conjugate even when the joint posterior is not.

Computational methods

Most interesting Bayesian models lack closed-form posteriors. The probability of the data integrated over a high-dimensional parameter space cannot be computed by hand, so practitioners turn to one of three broad approximation strategies: Markov chain Monte Carlo, variational inference, and likelihood-free methods such as Approximate Bayesian Computation. Each strategy has subvariants tuned for specific kinds of model structure.

Method family	Representative algorithms	Strengths	Weaknesses
MCMC	Metropolis-Hastings, Gibbs, HMC, NUTS, slice sampling	Asymptotically exact, well understood, many diagnostics	Slow, hard to scale, autocorrelated samples
Variational inference	Mean-field VI, ADVI, normalizing flows, BBVI	Fast, scales to large data, deterministic optimization	Biased, underestimates variance, model-dependent quality
Likelihood-free	ABC rejection, ABC-MCMC, ABC-SMC, simulation-based inference with neural networks	Works when likelihood is intractable	Sensitive to summary statistics and tolerance
Sequential Monte Carlo	Particle filters, SMC samplers	Handles dynamic systems and tempered targets	Particle degeneracy, tuning difficulty
Laplace approximation	Gaussian fit at posterior mode	Very fast, useful baseline	Poor for skewed or multimodal posteriors

Markov chain Monte Carlo

Markov chain Monte Carlo constructs a Markov chain whose stationary distribution is the target posterior. Sampling the chain long enough yields draws that, although correlated, can be used to estimate posterior expectations, quantiles, and credible intervals.

The Metropolis-Hastings algorithm, introduced by Nicholas Metropolis and colleagues in 1953 and generalized by W. K. Hastings in 1970, proposes new states from a chosen distribution and accepts or rejects them based on a ratio that depends only on the unnormalized posterior. Gibbs sampling, popularized for image processing by Stuart and Donald Geman in 1984 and for general statistical models by Gelfand and Smith in 1990, updates one parameter at a time by sampling from its full conditional distribution. Gibbs sampling is especially powerful when conditional distributions are conjugate.

Hamiltonian Monte Carlo, originally introduced as Hybrid Monte Carlo by Duane, Kennedy, Pendleton, and Roweth in 1987 for lattice quantum chromodynamics, treats parameters as positions and introduces auxiliary momentum variables. It uses the gradient of the log posterior to propose long, low-rejection trajectories that explore the posterior more efficiently than random-walk Metropolis. The No-U-Turn Sampler, NUTS, introduced by Matthew Hoffman and Andrew Gelman in 2014, automatically tunes the trajectory length, removing one of the major hand-tuning burdens of HMC. NUTS is the default sampler in Stan and PyMC.

Diagnostics for MCMC convergence include trace plots, the Gelman-Rubin R-hat statistic, effective sample size, energy diagnostics for HMC, and posterior predictive checks. A chain that fails these diagnostics produces samples that may not represent the true posterior, and reporting results from unconverged chains is a recurring source of error in applied Bayesian work.

Variational inference

Variational inference recasts posterior approximation as an optimization problem. The analyst chooses a family of tractable distributions q(theta), then minimizes the Kullback-Leibler divergence from q to the true posterior p(theta | D). Because the divergence cannot be computed without the intractable evidence, the equivalent maximization of the evidence lower bound, ELBO, is performed instead.

Mean-field variational inference assumes the posterior factorizes across parameters, which makes the optimization tractable but tends to underestimate posterior correlations and variance. Black-box variational inference, introduced by Rajesh Ranganath, Sean Gerrish, and David Blei in 2014, uses Monte Carlo estimates of the ELBO gradient and works for any differentiable model. Automatic Differentiation Variational Inference, ADVI, implemented in Stan by Alp Kucukelbir and colleagues in 2017, transforms constrained parameters to unconstrained space and fits a Gaussian or full-rank multivariate Gaussian variational family. Normalizing flows, introduced by Danilo Rezende and Shakir Mohamed in 2015, parameterize flexible variational distributions through sequences of invertible neural network transformations.

Variational inference dominates large-scale and online Bayesian work because it converts an integration problem into a stochastic optimization problem solvable with the same gradient descent infrastructure used to train neural networks. Stochastic variational inference, introduced by Matthew Hoffman, David Blei, Chong Wang, and John Paisley in 2013, scales to massive datasets by subsampling minibatches.

Approximate Bayesian Computation

Approximate Bayesian Computation, ABC, handles models where the likelihood function is intractable but the model can be simulated. The basic recipe samples a parameter from the prior, simulates a dataset, and accepts the parameter if a chosen distance between simulated and observed summary statistics falls below a tolerance. ABC originated in population genetics in the 1980s and 1990s and is now widely used in epidemiology, ecology, cosmology, and any field with rich simulators but no analytic likelihoods. Modern variants include ABC-MCMC, ABC sequential Monte Carlo, and neural simulation-based inference methods such as masked autoregressive flows for density estimation. The 2020 paper by Kyle Cranmer, Johann Brehmer, and Gilles Louppe surveys these neural approaches in detail.

Probabilistic programming

A probabilistic programming language is a programming language whose primitives include random variables and observed data. It allows the user to write a generative model in code, then automatically performs inference using one or more of the methods above. Probabilistic programming has been the single biggest accelerant for applied Bayesian work in the last fifteen years.

Library	Year	Backend	Inference engines	Notable features
BUGS / WinBUGS / OpenBUGS	1989 / 1997 / 2005	Custom	Gibbs, Metropolis	First widely used PPL, declarative model syntax
JAGS	2007	C++	Gibbs, slice, Metropolis	Cross-platform BUGS successor
Stan	2012	C++ with autodiff	NUTS, ADVI, optimization	High-performance HMC, R/Python/Julia/Shell interfaces
PyMC	2003 onward (PyMC3 in 2016, PyMC v5 in 2023)	Python on PyTensor	NUTS, ADVI, SMC	Pythonic API, GPU support via JAX backend
Edward	2016	TensorFlow	HMC, VI, GAN inference	Predecessor of TensorFlow Probability
TensorFlow Probability	2018	TensorFlow	HMC, NUTS, VI	Production-grade, used at Google
Pyro	2017	PyTorch	SVI, HMC, NUTS	Designed for deep generative models
NumPyro	2019	JAX	NUTS, SVI, SA	Extremely fast NUTS via JIT compilation
Turing.jl	2018	Julia	HMC, NUTS, particle Gibbs	Native Julia composability
Gen	2019	Julia	Programmable inference	Fine-grained control over inference

These libraries lower the cost of trying a Bayesian approach so dramatically that many problems once handled with frequentist defaults are now tackled probabilistically. NumPyro in particular has changed expectations for inference speed by exploiting JAX's just-in-time compilation and vectorization, allowing models with thousands of parameters to be sampled in seconds rather than hours.

Bayesian methods in modern AI

Bayesian inference appears throughout modern artificial intelligence, sometimes as the primary technique and sometimes as a regularization principle borrowed by otherwise non-Bayesian methods.

Bayesian deep learning

Bayesian deep learning treats neural network weights as random variables and aims to compute or approximate the posterior over weights given the training data. The exact posterior is intractable for any nontrivial network because the parameter count runs from millions to trillions, so the field is dominated by approximations.

Monte Carlo dropout, introduced by Yarin Gal and Zoubin Ghahramani in 2016, observes that applying dropout at inference time in a network trained with dropout regularization is mathematically equivalent to approximate variational inference in a deep Gaussian process. Running the network many times with different dropout masks yields a sample from an approximate predictive distribution, providing a cheap uncertainty estimate without changing the training pipeline.

Weight uncertainty methods, of which Bayes by Backprop by Charles Blundell and colleagues in 2015 is the canonical example, parameterize each weight by a mean and standard deviation and learn both via stochastic variational inference. The result is a network that samples its own weights at every forward pass.

The Laplace approximation, revived for deep networks by Hippolyt Ritter, Aleksandar Botev, and David Barber in 2018 and packaged in libraries such as Laplace Redux and laplace-torch, fits a Gaussian to the posterior at the trained weights using the Hessian or Fisher information matrix. It is essentially free given an already-trained network and provides surprisingly good calibration for many tasks.

Deep ensembles, introduced by Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell in 2017, train multiple networks with different random initializations and average their predictions. Ensembles are not strictly Bayesian but have been shown by Andrew Wilson and Pavel Izmailov in 2020 to approximate a posterior over the loss landscape, and they often outperform formal Bayesian methods in practice.

Bayesian neural networks

Bayesian neural networks, BNNs, are the broader class of neural network models with prior distributions over weights. The phrase is sometimes used interchangeably with Bayesian deep learning. Recent surveys by Andrew Wilson, Eric Nalisnick, Yingzhen Li, and others document the trade-offs and open problems, including prior choice for over-parameterized models, the cold posterior effect first reported by Florian Wenzel and colleagues in 2020, and the relationship between SGD with weight noise and Bayesian inference.

Bayesian optimization

Bayesian optimization is a derivative-free strategy for optimizing expensive black-box functions. It places a probabilistic surrogate model over the objective, usually a Gaussian process, and chooses the next evaluation point by maximizing an acquisition function such as expected improvement, upper confidence bound, or knowledge gradient. Each evaluation updates the surrogate, narrowing the posterior over plausible function shapes.

The approach is dominant for hyperparameter tuning of machine learning models. Libraries such as Spearmint, GPyOpt, scikit-optimize, BayesSearchCV, BoTorch (built on PyTorch by Meta), and Ax (also from Meta) implement Bayesian optimization for problems ranging from neural network architecture search to industrial chemical process optimization. Optuna, while based primarily on the tree-structured Parzen estimator rather than Gaussian processes, sits in the same family of sequential model-based optimization methods. Bayesian optimization has been used to tune large language model training recipes, set sampling parameters for diffusion models, and configure expensive scientific simulations.

Hyperparameter tuning

Classical grid search and random search make no use of the information from previous trials. Bayesian optimization treats the validation loss as a draw from a probabilistic surrogate, so each trial reduces uncertainty about which regions of the configuration space deserve further exploration. The cost-benefit trade-off favors Bayesian optimization when each trial is expensive (training a model takes hours or days) and the configuration space is small to medium. Random search remains competitive when configurations are cheap and the search dimension is high.

Probabilistic graphical models

Bayesian networks and Markov random fields are graphical representations of joint distributions. The graph structure encodes conditional independence relations that make inference tractable. Pearl's 1988 book established the message-passing algorithms, particularly belief propagation, that exploit these structures. Hidden Markov models, latent Dirichlet allocation for topic modeling, and Kalman filters are all instances of the broader graphical model framework.

Latent variable models

Many generative models in AI are explicit Bayesian latent variable models. Latent Dirichlet allocation, introduced by David Blei, Andrew Ng, and Michael Jordan in 2003, models documents as mixtures over topics with Dirichlet priors. The variational autoencoder, introduced by Diederik Kingma and Max Welling in 2013, is a deep latent variable model trained with amortized variational inference. Even diffusion models can be framed as deeply hierarchical latent variable models with a Gaussian prior at each step.

Causal inference

Judea Pearl's structural causal models and the do-calculus extend Bayesian networks to encode causal rather than purely associational relationships. Bayesian methods are widely used in causal inference for handling unmeasured confounders, propagating uncertainty through causal effect estimates, and combining experimental with observational data in hierarchical models. The PyMC and Stan ecosystems both have well-developed causal inference workflows.

A/B testing

Bayesian A/B testing replaces frequentist null hypothesis significance testing with direct posterior statements such as "there is a 96% probability that variant B has a higher conversion rate than variant A." Beta-Binomial conjugacy makes the math trivial, sequential analysis becomes legitimate without alpha-spending corrections, and decisions can be framed in expected loss terms. Companies including Google, Microsoft, Facebook, Booking, and Netflix have published detailed accounts of their experimentation platforms, many of which mix frequentist and Bayesian elements.

Reinforcement learning

Thompson sampling, introduced by William R. Thompson in 1933 and rediscovered as a strong heuristic for multi-armed bandits in the 2010s, is a Bayesian strategy that samples actions according to the posterior probability that each is optimal. It performs competitively with upper confidence bound algorithms while being simpler to implement and naturally exploration-exploration balanced. Bayesian reinforcement learning more broadly maintains posteriors over transition dynamics and reward functions, enabling principled exploration in model-based methods.

Large language models

Classical Bayesian inference does not directly produce or train large language models, but Bayesian thinking pervades several adjacent areas. In-context learning has been analyzed as approximate posterior inference over latent task variables. Calibration of LLM token probabilities draws on Bayesian decision theory. Bayesian optimization tunes training hyperparameters and reinforcement learning from human feedback policies. Posterior sampling is used in some approaches to LLM uncertainty quantification and active learning for fine-tuning.

Limitations and criticisms

Bayesian inference has known difficulties that practitioners must address. The choice of prior is the most discussed concern. Different reasonable priors can lead to different posteriors when data is sparse, and seemingly innocent uniform priors over parameters can become highly informative on transformed scales. Sensitivity analysis, comparing results across reasonable priors, is the standard mitigation but is rarely performed thoroughly.

Computational cost is a second persistent issue. Even with NUTS and modern hardware, sampling from posteriors over millions or billions of parameters remains impractical. Scalability has driven the rise of variational methods, but variational approximations sacrifice accuracy in ways that are sometimes hard to detect.

Model misspecification is a third concern. Bayesian inference is consistent and well-calibrated only if the model includes the data-generating process or a close approximation. When the model is wrong the posterior may concentrate on parameter values that are best in a Kullback-Leibler sense but produce poor predictions. Robust Bayesian methods, model-checking via posterior predictive plots, and approaches such as the safe Bayes framework address parts of this problem.

Scaling Bayesian methods to deep neural networks remains an open research area. The cold posterior effect, where downweighting the likelihood improves test accuracy, has no fully satisfactory explanation and suggests that current priors over deep network weights are mismatched. The relationship between Bayesian model averaging, ensembling, and stochastic gradient descent dynamics is still being unraveled.

The interpretive flexibility that Bayesians value can also become a weakness in adversarial settings. A motivated researcher can shop priors and likelihoods to produce desired conclusions, just as a frequentist can p-hack. Pre-registration, transparent reporting, and standardized prior choices help but do not eliminate the risk.

Software ecosystem and reproducibility

The Bayesian community has invested heavily in reproducibility tooling. ArviZ, a Python library by Ravin Kumar and colleagues released in 2019, provides standard diagnostics, plots, and model comparison utilities that work across PyMC, Stan, NumPyro, Pyro, and TensorFlow Probability. Stan's Bayesian workflow, articulated in a 2020 paper by Andrew Gelman and twelve coauthors, codifies a sequence of model-building, prior predictive checking, fitting, diagnostics, posterior predictive checking, and model comparison steps that the community recommends as best practice.

Gaussian process libraries including GPy, GPflow, GPyTorch, and BoTorch implement scalable inference for Gaussian process models. GPyTorch's exploitation of structured kernel matrices and Lanczos methods has pushed Gaussian process inference from datasets of thousands to millions of points.

Mixture models and clustering rely on Bayesian inference in many flavors, from finite Gaussian mixture model inference via expectation-maximization with Bayesian regularization, to nonparametric Dirichlet process mixtures that learn the number of components from data. The scikit-learn implementation of Bayesian Gaussian mixtures is a common entry point for Python practitioners.

Notable references and reading

The canonical modern textbook is Andrew Gelman, John Carlin, Hal Stern, David Dunson, Aki Vehtari, and Donald Rubin's "Bayesian Data Analysis," third edition, published by Chapman and Hall in 2013, which combines theory, computation, and case studies. Christopher Bishop's "Pattern Recognition and Machine Learning," published by Springer in 2006, devotes its tenth chapter to approximate inference and ties Bayesian ideas to machine learning. David MacKay's "Information Theory, Inference, and Learning Algorithms," published by Cambridge University Press in 2003 and freely available online, develops Bayesian inference alongside information theory and includes some of the clearest expositions of Monte Carlo methods.

Kevin Murphy's "Probabilistic Machine Learning: An Introduction" (2022) and "Probabilistic Machine Learning: Advanced Topics" (2023), both from MIT Press, give comprehensive modern treatments. Edwin Jaynes' posthumous "Probability Theory: The Logic of Science" (2003) presents an opinionated logical view of probability. Judea Pearl's "Probabilistic Reasoning in Intelligent Systems" (1988) and "Causality" (2000) bridge Bayesian inference and causal reasoning. Sharon Bertsch McGrayne's popular history "The Theory That Would Not Die" (2011) traces the social history of Bayes' rule from cryptography to medicine.

For modern Bayesian deep learning, the surveys by Yarin Gal ("Uncertainty in Deep Learning," PhD thesis, University of Cambridge, 2016) and the more recent reviews by Andrew Gordon Wilson, Eric Nalisnick, and others provide good entry points. The Stan Reference Manual and the PyMC documentation are excellent practical references for applied work.

References

Bayes, T. (1763). "An Essay towards solving a Problem in the Doctrine of Chances." Philosophical Transactions of the Royal Society of London, 53, 370 to 418.
Laplace, P. S. (1814). "Essai philosophique sur les probabilites." Paris: Courcier.
Jeffreys, H. (1939). "Theory of Probability." Oxford: Clarendon Press.
Savage, L. J. (1954). "The Foundations of Statistics." New York: Wiley.
de Finetti, B. (1974). "Theory of Probability." New York: Wiley.
Lindley, D. V. (1965). "Introduction to Probability and Statistics from a Bayesian Viewpoint." Cambridge University Press.
Pearl, J. (1988). "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference." San Francisco: Morgan Kaufmann.
Pearl, J. (2000). "Causality: Models, Reasoning, and Inference." Cambridge University Press.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). "Bayesian Data Analysis," 3rd edition. Chapman and Hall / CRC.
Bishop, C. M. (2006). "Pattern Recognition and Machine Learning." Springer. Chapter 10 on Approximate Inference.
MacKay, D. J. C. (2003). "Information Theory, Inference, and Learning Algorithms." Cambridge University Press.
Jaynes, E. T. (2003). "Probability Theory: The Logic of Science." Cambridge University Press.
Murphy, K. P. (2022). "Probabilistic Machine Learning: An Introduction." MIT Press.
Murphy, K. P. (2023). "Probabilistic Machine Learning: Advanced Topics." MIT Press.
McGrayne, S. B. (2011). "The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy." Yale University Press.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953). "Equation of state calculations by fast computing machines." Journal of Chemical Physics, 21(6), 1087 to 1092.
Hastings, W. K. (1970). "Monte Carlo sampling methods using Markov chains and their applications." Biometrika, 57(1), 97 to 109.
Geman, S., and Geman, D. (1984). "Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images." IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6), 721 to 741.
Gelfand, A. E., and Smith, A. F. M. (1990). "Sampling-based approaches to calculating marginal densities." Journal of the American Statistical Association, 85(410), 398 to 409.
Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987). "Hybrid Monte Carlo." Physics Letters B, 195(2), 216 to 222.
Hoffman, M. D., and Gelman, A. (2014). "The No-U-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo." Journal of Machine Learning Research, 15, 1593 to 1623.
Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). "Stochastic variational inference." Journal of Machine Learning Research, 14, 1303 to 1347.
Ranganath, R., Gerrish, S., and Blei, D. M. (2014). "Black box variational inference." Proceedings of AISTATS.
Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2017). "Automatic differentiation variational inference." Journal of Machine Learning Research, 18(14), 1 to 45.
Rezende, D. J., and Mohamed, S. (2015). "Variational inference with normalizing flows." Proceedings of ICML.
Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M. A., Guo, J., Li, P., and Riddell, A. (2017). "Stan: A probabilistic programming language." Journal of Statistical Software, 76(1).
Salvatier, J., Wiecki, T. V., and Fonnesbeck, C. (2016). "Probabilistic programming in Python using PyMC3." PeerJ Computer Science, 2:e55.
Bingham, E., Chen, J. P., Jankowiak, M., Obermeyer, F., Pradhan, N., Karaletsos, T., Singh, R., Szerlip, P., Horsfall, P., and Goodman, N. D. (2019). "Pyro: Deep universal probabilistic programming." Journal of Machine Learning Research, 20(28), 1 to 6.
Phan, D., Pradhan, N., and Jankowiak, M. (2019). "Composable effects for flexible and accelerated probabilistic programming in NumPyro." arXiv:1912.11554.
Tran, D., Kucukelbir, A., Dieng, A. B., Rudolph, M., Liang, D., and Blei, D. M. (2016). "Edward: A library for probabilistic modeling, inference, and criticism." arXiv:1610.09787.
Gal, Y., and Ghahramani, Z. (2016). "Dropout as a Bayesian approximation: Representing model uncertainty in deep learning." Proceedings of ICML.
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). "Weight uncertainty in neural networks." Proceedings of ICML.
Ritter, H., Botev, A., and Barber, D. (2018). "A scalable Laplace approximation for neural networks." Proceedings of ICLR.
Daxberger, E., Kristiadi, A., Immer, A., Eschenhagen, R., Bauer, M., and Hennig, P. (2021). "Laplace Redux: Effortless Bayesian deep learning." NeurIPS.
Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2017). "Simple and scalable predictive uncertainty estimation using deep ensembles." NeurIPS.
Wilson, A. G., and Izmailov, P. (2020). "Bayesian deep learning and a probabilistic perspective of generalization." NeurIPS.
Wenzel, F., Roth, K., Veeling, B. S., Swiatkowski, J., Tran, L., Mandt, S., Snoek, J., Salimans, T., Jenatton, R., and Nowozin, S. (2020). "How good is the Bayes posterior in deep neural networks really?" Proceedings of ICML.
Snoek, J., Larochelle, H., and Adams, R. P. (2012). "Practical Bayesian optimization of machine learning algorithms." NeurIPS.
Balandat, M., Karrer, B., Jiang, D. R., Daulton, S., Letham, B., Wilson, A. G., and Bakshy, E. (2020). "BoTorch: A framework for efficient Monte Carlo Bayesian optimization." NeurIPS.
Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M. (2019). "Optuna: A next-generation hyperparameter optimization framework." Proceedings of KDD.
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). "Latent Dirichlet allocation." Journal of Machine Learning Research, 3, 993 to 1022.
Kingma, D. P., and Welling, M. (2013). "Auto-encoding variational Bayes." arXiv:1312.6114.
Cranmer, K., Brehmer, J., and Louppe, G. (2020). "The frontier of simulation-based inference." PNAS, 117(48), 30055 to 30062.
Beaumont, M. A., Zhang, W., and Balding, D. J. (2002). "Approximate Bayesian computation in population genetics." Genetics, 162(4), 2025 to 2035.
Thompson, W. R. (1933). "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples." Biometrika, 25(3/4), 285 to 294.
Chapelle, O., and Li, L. (2011). "An empirical evaluation of Thompson sampling." NeurIPS.
Gelman, A., Vehtari, A., Simpson, D., Margossian, C. C., Carpenter, B., Yao, Y., Kennedy, L., Gabry, J., Burkner, P. C., and Modrak, M. (2020). "Bayesian workflow." arXiv:2011.01808.
Kumar, R., Carroll, C., Hartikainen, A., and Martin, O. A. (2019). "ArviZ: a unified library for exploratory analysis of Bayesian models in Python." Journal of Open Source Software, 4(33), 1143.
Gardner, J., Pleiss, G., Weinberger, K. Q., Bindel, D., and Wilson, A. G. (2018). "GPyTorch: Blackbox matrix-matrix Gaussian process inference with GPU acceleration." NeurIPS.
Gal, Y. (2016). "Uncertainty in deep learning." PhD thesis, University of Cambridge.

Definition and components

Prior

Likelihood

Posterior

Evidence

A simple worked example

History

Frequentist versus Bayesian inference

Conjugate priors

Beta-Binomial

Gamma-Poisson

Normal-Normal

Computational methods

Markov chain Monte Carlo

Variational inference

Approximate Bayesian Computation

Probabilistic programming

Bayesian methods in modern AI

Bayesian deep learning

Bayesian neural networks

Bayesian optimization

Hyperparameter tuning

Probabilistic graphical models

Latent variable models

Causal inference

A/B testing

Reinforcement learning

Large language models

Limitations and criticisms

Software ecosystem and reproducibility

Notable references and reading

See also

References

Improve this article

Related Articles

Bayes' theorem

ARC-AGI 2

Markov Chain Monte Carlo

AUC-ROC

ARIMA

Markov property

Definition and components

Prior

Likelihood

Posterior

Evidence

A simple worked example

History

Frequentist versus Bayesian inference

Conjugate priors

Beta-Binomial

Gamma-Poisson

Normal-Normal

Computational methods

Markov chain Monte Carlo

Variational inference

Approximate Bayesian Computation

Probabilistic programming

Bayesian methods in modern AI

Bayesian deep learning

Bayesian neural networks

Bayesian optimization

Hyperparameter tuning

Probabilistic graphical models

Latent variable models

Causal inference

A/B testing

Reinforcement learning

Large language models

Limitations and criticisms

Software ecosystem and reproducibility

Notable references and reading

See also

References

Related Articles

Bayes' theorem

ARC-AGI 2

Markov Chain Monte Carlo

AUC-ROC

ARIMA