Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability of a hypothesis as new evidence or data becomes available. It treats unknown parameters as random variables described by probability distributions, combines prior beliefs with observed data through a likelihood function, and produces a posterior distribution that quantifies the remaining uncertainty about those parameters. Bayesian inference forms the conceptual backbone of large parts of modern machine learning, probabilistic modeling, decision theory, and scientific data analysis. It powers techniques ranging from spam filters and recommendation systems to Bayesian optimization of large language model hyperparameters and uncertainty quantification in safety-critical AI systems.
The approach contrasts with classical, or frequentist, inference, which treats parameters as fixed unknown constants and reasons about the long-run behavior of estimators across many hypothetical repetitions of an experiment. In the Bayesian view, probability is a measure of degree of belief, not a frequency. This interpretive shift has practical consequences: Bayesian methods naturally handle small samples, sequential data arrival, hierarchical structure, and the propagation of uncertainty through complex models. The framework has been formalized over more than two centuries, beginning with Thomas Bayes in the eighteenth century, generalized by Pierre-Simon Laplace, and revived in the twentieth century through the work of Harold Jeffreys, Dennis Lindley, Bruno de Finetti, Edwin Jaynes, and Judea Pearl.
The basic statement of Bayesian inference is given by Bayes' rule:
P(theta | D) = P(D | theta) * P(theta) / P(D)
Here theta represents the unknown parameter or hypothesis, and D represents the observed data. The four quantities have specific names that recur throughout the Bayesian literature.
| Symbol | Name | Meaning |
|---|---|---|
| P(theta) | Prior | Belief about theta before seeing the data |
| P(D | theta) | Likelihood | Probability of the data given a specific value of theta |
| P(theta | D) | Posterior | Updated belief about theta after observing the data |
| P(D) | Evidence (marginal likelihood) | Probability of the data, averaged over all values of theta |
The posterior is proportional to the prior multiplied by the likelihood. The evidence in the denominator acts as a normalizing constant that ensures the posterior integrates to one. In many problems the evidence is intractable because it requires integration over the entire parameter space, and most computational machinery in modern Bayesian work exists to circumvent this integral.
The prior distribution encodes what is known or assumed about the parameter before any data has been observed. Priors can be informative, drawing on previous studies, expert knowledge, or theoretical constraints, or they can be weakly informative or non-informative when the analyst wants the data to dominate. Common forms include flat (uniform) priors over a bounded range, conjugate priors chosen for mathematical convenience, hierarchical priors that share information across groups, and weakly regularizing priors such as the half-normal or half-Cauchy distributions that Andrew Gelman has popularized for variance parameters. The choice of prior is one of the most debated aspects of Bayesian methodology and is often the focal point of objections from frequentist statisticians.
The likelihood function is the same object used in classical maximum likelihood estimation. It expresses the probability of the observed data as a function of the unknown parameter. In Bayesian inference the likelihood is multiplied by the prior rather than maximized directly. The likelihood principle, which states that all evidence about theta from data D is contained in the likelihood function, is honored by Bayesian inference and routinely violated by certain frequentist procedures such as fixed-sample-size hypothesis tests with stopping rules.
The posterior distribution is the central object of Bayesian inference. From it one can extract point estimates such as the posterior mean, median, or maximum a posteriori (MAP) value, interval estimates such as credible intervals, predictive distributions for new observations, decision-theoretic actions that minimize expected loss, and model comparisons via Bayes factors or marginal likelihoods. Unlike a frequentist confidence interval, a 95% Bayesian credible interval has the direct interpretation that there is a 95% probability the parameter lies in the interval given the observed data and the chosen prior.
The evidence, also called the marginal likelihood, is the probability of the observed data integrated over all possible parameter values weighted by the prior. It plays two roles. As a normalizer it makes the posterior a proper probability distribution. As a model comparison tool it provides the basis for the Bayes factor, which compares two competing models by the ratio of their evidences. The evidence rewards models that fit the data well while penalizing unnecessary complexity, an automatic Occam's razor effect that does not require an explicit complexity term.
A standard introduction is the coin flipping problem. Suppose a coin has unknown probability theta of landing heads, and the analyst observes seven heads in ten flips. With a uniform prior on theta over [0, 1], the posterior is a Beta(8, 4) distribution, with mean 8 / 12, or about 0.67. The maximum likelihood estimate would be 0.7. The two answers are similar because the uniform prior carries little information, but if the analyst had used a Beta(50, 50) prior reflecting strong prior belief that the coin is fair, the posterior would shift back toward 0.5. As more flips accumulate the prior matters less and the posterior converges to a point mass at the true value of theta, a property formalized by the Bernstein-von Mises theorem.
The theorem at the core of Bayesian inference was developed by the English Presbyterian minister Thomas Bayes and published posthumously in 1763 by his friend Richard Price under the title "An Essay towards solving a Problem in the Doctrine of Chances." Bayes himself worked out the special case of a uniform prior on a Bernoulli probability. The general formulation, including the use of arbitrary priors and the treatment of multidimensional problems, was developed independently and far more comprehensively by Pierre-Simon Laplace, who applied it to celestial mechanics, demography, and jurisprudence between roughly 1774 and 1814. Laplace's 1814 "Essai philosophique sur les probabilites" is often regarded as the first explicit Bayesian methodological text.
Throughout the nineteenth and early twentieth centuries Bayesian methods were practiced under various names but came under sharp criticism from Ronald Fisher, Jerzy Neyman, and Egon Pearson, the architects of frequentist inference. Fisher's introduction of the p-value, fiducial inference, and the maximum likelihood method displaced Bayesian techniques in much of mainstream statistics by the 1930s. The Bayesian school survived through the writings of Harold Jeffreys, whose 1939 book "Theory of Probability" provided a systematic treatment of objective priors, and through Bruno de Finetti's foundational work on subjective probability and exchangeability.
A modern revival began in the 1950s and 1960s. Leonard Savage's 1954 "The Foundations of Statistics" rebuilt decision theory on Bayesian foundations. Dennis Lindley championed Bayesian methods at British universities and helped train a generation of practitioners. Edwin Jaynes, working in physics, advocated maximum entropy priors and a logical interpretation of probability. The publication of Judea Pearl's "Probabilistic Reasoning in Intelligent Systems" in 1988 brought Bayesian networks to artificial intelligence and laid the groundwork for the graphical model revolution.
The practical revolution came with computation. Adrian Smith and his collaborators in the early 1990s, building on the Geman brothers' 1984 paper on Gibbs sampling, demonstrated that Markov chain Monte Carlo could make complex Bayesian models routinely tractable. The release of BUGS in 1989 and WinBUGS in 1997 democratized MCMC. Stan, released in 2012 and built on Hamiltonian Monte Carlo with the No-U-Turn Sampler, brought Bayesian inference to a much wider audience. By the 2010s Bayesian methods were standard in genetics, ecology, astronomy, epidemiology, and increasingly in industrial machine learning.
The long-running debate between frequentist and Bayesian statisticians is partly philosophical and partly practical. The two schools agree on most numerical answers in simple problems with weak priors but diverge on interpretation, methodology in complex problems, and treatment of nuisance parameters.
| Aspect | Frequentist | Bayesian |
|---|---|---|
| Probability interpretation | Long-run frequency in repeated trials | Degree of belief or rational uncertainty |
| Parameters | Fixed unknown constants | Random variables with distributions |
| Inference output | Point estimates, p-values, confidence intervals | Posterior distributions, credible intervals |
| Prior information | Generally not used explicitly | Encoded in prior distribution |
| Sequential analysis | Requires special procedures | Natural and coherent |
| Hierarchical models | Difficult or ad hoc | Natural via hyperpriors |
| Small samples | Often unreliable | Coherent if priors are reasonable |
| Computational cost | Usually low | Often high without specialized methods |
| Subjectivity | Hidden in choice of test, stopping rule, model | Made explicit in the prior |
A 95% confidence interval has a definition that is awkward to convey: if the experiment were repeated infinitely many times, 95% of the constructed intervals would contain the true parameter. A 95% credible interval has the direct interpretation that the parameter is in the interval with probability 0.95 given the data and prior. Most working scientists, when pressed, treat their confidence intervals as if they were credible intervals, which is one of the practical arguments for the Bayesian framework.
The debate has cooled since the 2000s. Many statisticians now describe themselves as pragmatic, choosing methods to fit problems rather than ideology. The deep learning era has seen a partial Bayesian revival because uncertainty quantification matters in autonomous driving, medical AI, and reinforcement learning, and frequentist tools struggle to provide it for over-parameterized neural networks.
Conjugate priors are pairs of prior and likelihood distributions chosen so that the posterior remains in the same family as the prior. They make Bayesian updating analytically tractable and were the workhorse of pre-computational Bayesian work. Three classical conjugate pairs continue to appear constantly in modern applications.
If the likelihood is binomial, modeling the number of successes in a fixed number of independent trials with success probability theta, then the conjugate prior is the Beta distribution. With prior Beta(alpha, beta) and observed data of k successes in n trials, the posterior is Beta(alpha + k, beta + n - k). The Beta-Binomial pair underlies Bayesian A/B testing, click-through rate estimation, multi-armed bandit algorithms such as Thompson sampling, and conversion rate analysis in product analytics.
For count data modeled with a Poisson likelihood with rate parameter lambda, the conjugate prior is the Gamma distribution. With prior Gamma(alpha, beta) and observed count sum s over n observations, the posterior is Gamma(alpha + s, beta + n). The Gamma-Poisson pair appears in queueing analysis, insurance claim modeling, defect counting, citation analysis, and any application that counts independent events occurring at an unknown rate.
For a normal likelihood with known variance and unknown mean, the conjugate prior on the mean is also a normal distribution. The posterior mean is a precision-weighted average of the prior mean and the data mean, and the posterior precision is the sum of the prior precision and the data precision. This pair generalizes to the multivariate normal and to the normal-inverse-gamma when both mean and variance are unknown. It is the building block of Kalman filters and Bayesian linear regression.
Conjugate priors lose their dominance once models become hierarchical, nonlinear, or high-dimensional, because the analytical convenience disappears. They survive as components inside larger Gibbs samplers, where each conditional distribution may still be conjugate even when the joint posterior is not.
Most interesting Bayesian models lack closed-form posteriors. The probability of the data integrated over a high-dimensional parameter space cannot be computed by hand, so practitioners turn to one of three broad approximation strategies: Markov chain Monte Carlo, variational inference, and likelihood-free methods such as Approximate Bayesian Computation. Each strategy has subvariants tuned for specific kinds of model structure.
| Method family | Representative algorithms | Strengths | Weaknesses |
|---|---|---|---|
| MCMC | Metropolis-Hastings, Gibbs, HMC, NUTS, slice sampling | Asymptotically exact, well understood, many diagnostics | Slow, hard to scale, autocorrelated samples |
| Variational inference | Mean-field VI, ADVI, normalizing flows, BBVI | Fast, scales to large data, deterministic optimization | Biased, underestimates variance, model-dependent quality |
| Likelihood-free | ABC rejection, ABC-MCMC, ABC-SMC, simulation-based inference with neural networks | Works when likelihood is intractable | Sensitive to summary statistics and tolerance |
| Sequential Monte Carlo | Particle filters, SMC samplers | Handles dynamic systems and tempered targets | Particle degeneracy, tuning difficulty |
| Laplace approximation | Gaussian fit at posterior mode | Very fast, useful baseline | Poor for skewed or multimodal posteriors |
Markov chain Monte Carlo constructs a Markov chain whose stationary distribution is the target posterior. Sampling the chain long enough yields draws that, although correlated, can be used to estimate posterior expectations, quantiles, and credible intervals.
The Metropolis-Hastings algorithm, introduced by Nicholas Metropolis and colleagues in 1953 and generalized by W. K. Hastings in 1970, proposes new states from a chosen distribution and accepts or rejects them based on a ratio that depends only on the unnormalized posterior. Gibbs sampling, popularized for image processing by Stuart and Donald Geman in 1984 and for general statistical models by Gelfand and Smith in 1990, updates one parameter at a time by sampling from its full conditional distribution. Gibbs sampling is especially powerful when conditional distributions are conjugate.
Hamiltonian Monte Carlo, originally introduced as Hybrid Monte Carlo by Duane, Kennedy, Pendleton, and Roweth in 1987 for lattice quantum chromodynamics, treats parameters as positions and introduces auxiliary momentum variables. It uses the gradient of the log posterior to propose long, low-rejection trajectories that explore the posterior more efficiently than random-walk Metropolis. The No-U-Turn Sampler, NUTS, introduced by Matthew Hoffman and Andrew Gelman in 2014, automatically tunes the trajectory length, removing one of the major hand-tuning burdens of HMC. NUTS is the default sampler in Stan and PyMC.
Diagnostics for MCMC convergence include trace plots, the Gelman-Rubin R-hat statistic, effective sample size, energy diagnostics for HMC, and posterior predictive checks. A chain that fails these diagnostics produces samples that may not represent the true posterior, and reporting results from unconverged chains is a recurring source of error in applied Bayesian work.
Variational inference recasts posterior approximation as an optimization problem. The analyst chooses a family of tractable distributions q(theta), then minimizes the Kullback-Leibler divergence from q to the true posterior p(theta | D). Because the divergence cannot be computed without the intractable evidence, the equivalent maximization of the evidence lower bound, ELBO, is performed instead.
Mean-field variational inference assumes the posterior factorizes across parameters, which makes the optimization tractable but tends to underestimate posterior correlations and variance. Black-box variational inference, introduced by Rajesh Ranganath, Sean Gerrish, and David Blei in 2014, uses Monte Carlo estimates of the ELBO gradient and works for any differentiable model. Automatic Differentiation Variational Inference, ADVI, implemented in Stan by Alp Kucukelbir and colleagues in 2017, transforms constrained parameters to unconstrained space and fits a Gaussian or full-rank multivariate Gaussian variational family. Normalizing flows, introduced by Danilo Rezende and Shakir Mohamed in 2015, parameterize flexible variational distributions through sequences of invertible neural network transformations.
Variational inference dominates large-scale and online Bayesian work because it converts an integration problem into a stochastic optimization problem solvable with the same gradient descent infrastructure used to train neural networks. Stochastic variational inference, introduced by Matthew Hoffman, David Blei, Chong Wang, and John Paisley in 2013, scales to massive datasets by subsampling minibatches.
Approximate Bayesian Computation, ABC, handles models where the likelihood function is intractable but the model can be simulated. The basic recipe samples a parameter from the prior, simulates a dataset, and accepts the parameter if a chosen distance between simulated and observed summary statistics falls below a tolerance. ABC originated in population genetics in the 1980s and 1990s and is now widely used in epidemiology, ecology, cosmology, and any field with rich simulators but no analytic likelihoods. Modern variants include ABC-MCMC, ABC sequential Monte Carlo, and neural simulation-based inference methods such as masked autoregressive flows for density estimation. The 2020 paper by Kyle Cranmer, Johann Brehmer, and Gilles Louppe surveys these neural approaches in detail.
A probabilistic programming language is a programming language whose primitives include random variables and observed data. It allows the user to write a generative model in code, then automatically performs inference using one or more of the methods above. Probabilistic programming has been the single biggest accelerant for applied Bayesian work in the last fifteen years.
| Library | Year | Backend | Inference engines | Notable features |
|---|---|---|---|---|
| BUGS / WinBUGS / OpenBUGS | 1989 / 1997 / 2005 | Custom | Gibbs, Metropolis | First widely used PPL, declarative model syntax |
| JAGS | 2007 | C++ | Gibbs, slice, Metropolis | Cross-platform BUGS successor |
| Stan | 2012 | C++ with autodiff | NUTS, ADVI, optimization | High-performance HMC, R/Python/Julia/Shell interfaces |
| PyMC | 2003 onward (PyMC3 in 2016, PyMC v5 in 2023) | Python on PyTensor | NUTS, ADVI, SMC | Pythonic API, GPU support via JAX backend |
| Edward | 2016 | TensorFlow | HMC, VI, GAN inference | Predecessor of TensorFlow Probability |
| TensorFlow Probability | 2018 | TensorFlow | HMC, NUTS, VI | Production-grade, used at Google |
| Pyro | 2017 | PyTorch | SVI, HMC, NUTS | Designed for deep generative models |
| NumPyro | 2019 | JAX | NUTS, SVI, SA | Extremely fast NUTS via JIT compilation |
| Turing.jl | 2018 | Julia | HMC, NUTS, particle Gibbs | Native Julia composability |
| Gen | 2019 | Julia | Programmable inference | Fine-grained control over inference |
These libraries lower the cost of trying a Bayesian approach so dramatically that many problems once handled with frequentist defaults are now tackled probabilistically. NumPyro in particular has changed expectations for inference speed by exploiting JAX's just-in-time compilation and vectorization, allowing models with thousands of parameters to be sampled in seconds rather than hours.
Bayesian inference appears throughout modern artificial intelligence, sometimes as the primary technique and sometimes as a regularization principle borrowed by otherwise non-Bayesian methods.
Bayesian deep learning treats neural network weights as random variables and aims to compute or approximate the posterior over weights given the training data. The exact posterior is intractable for any nontrivial network because the parameter count runs from millions to trillions, so the field is dominated by approximations.
Monte Carlo dropout, introduced by Yarin Gal and Zoubin Ghahramani in 2016, observes that applying dropout at inference time in a network trained with dropout regularization is mathematically equivalent to approximate variational inference in a deep Gaussian process. Running the network many times with different dropout masks yields a sample from an approximate predictive distribution, providing a cheap uncertainty estimate without changing the training pipeline.
Weight uncertainty methods, of which Bayes by Backprop by Charles Blundell and colleagues in 2015 is the canonical example, parameterize each weight by a mean and standard deviation and learn both via stochastic variational inference. The result is a network that samples its own weights at every forward pass.
The Laplace approximation, revived for deep networks by Hippolyt Ritter, Aleksandar Botev, and David Barber in 2018 and packaged in libraries such as Laplace Redux and laplace-torch, fits a Gaussian to the posterior at the trained weights using the Hessian or Fisher information matrix. It is essentially free given an already-trained network and provides surprisingly good calibration for many tasks.
Deep ensembles, introduced by Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell in 2017, train multiple networks with different random initializations and average their predictions. Ensembles are not strictly Bayesian but have been shown by Andrew Wilson and Pavel Izmailov in 2020 to approximate a posterior over the loss landscape, and they often outperform formal Bayesian methods in practice.
Bayesian neural networks, BNNs, are the broader class of neural network models with prior distributions over weights. The phrase is sometimes used interchangeably with Bayesian deep learning. Recent surveys by Andrew Wilson, Eric Nalisnick, Yingzhen Li, and others document the trade-offs and open problems, including prior choice for over-parameterized models, the cold posterior effect first reported by Florian Wenzel and colleagues in 2020, and the relationship between SGD with weight noise and Bayesian inference.
Bayesian optimization is a derivative-free strategy for optimizing expensive black-box functions. It places a probabilistic surrogate model over the objective, usually a Gaussian process, and chooses the next evaluation point by maximizing an acquisition function such as expected improvement, upper confidence bound, or knowledge gradient. Each evaluation updates the surrogate, narrowing the posterior over plausible function shapes.
The approach is dominant for hyperparameter tuning of machine learning models. Libraries such as Spearmint, GPyOpt, scikit-optimize, BayesSearchCV, BoTorch (built on PyTorch by Meta), and Ax (also from Meta) implement Bayesian optimization for problems ranging from neural network architecture search to industrial chemical process optimization. Optuna, while based primarily on the tree-structured Parzen estimator rather than Gaussian processes, sits in the same family of sequential model-based optimization methods. Bayesian optimization has been used to tune large language model training recipes, set sampling parameters for diffusion models, and configure expensive scientific simulations.
Classical grid search and random search make no use of the information from previous trials. Bayesian optimization treats the validation loss as a draw from a probabilistic surrogate, so each trial reduces uncertainty about which regions of the configuration space deserve further exploration. The cost-benefit trade-off favors Bayesian optimization when each trial is expensive (training a model takes hours or days) and the configuration space is small to medium. Random search remains competitive when configurations are cheap and the search dimension is high.
Bayesian networks and Markov random fields are graphical representations of joint distributions. The graph structure encodes conditional independence relations that make inference tractable. Pearl's 1988 book established the message-passing algorithms, particularly belief propagation, that exploit these structures. Hidden Markov models, latent Dirichlet allocation for topic modeling, and Kalman filters are all instances of the broader graphical model framework.
Many generative models in AI are explicit Bayesian latent variable models. Latent Dirichlet allocation, introduced by David Blei, Andrew Ng, and Michael Jordan in 2003, models documents as mixtures over topics with Dirichlet priors. The variational autoencoder, introduced by Diederik Kingma and Max Welling in 2013, is a deep latent variable model trained with amortized variational inference. Even diffusion models can be framed as deeply hierarchical latent variable models with a Gaussian prior at each step.
Judea Pearl's structural causal models and the do-calculus extend Bayesian networks to encode causal rather than purely associational relationships. Bayesian methods are widely used in causal inference for handling unmeasured confounders, propagating uncertainty through causal effect estimates, and combining experimental with observational data in hierarchical models. The PyMC and Stan ecosystems both have well-developed causal inference workflows.
Bayesian A/B testing replaces frequentist null hypothesis significance testing with direct posterior statements such as "there is a 96% probability that variant B has a higher conversion rate than variant A." Beta-Binomial conjugacy makes the math trivial, sequential analysis becomes legitimate without alpha-spending corrections, and decisions can be framed in expected loss terms. Companies including Google, Microsoft, Facebook, Booking, and Netflix have published detailed accounts of their experimentation platforms, many of which mix frequentist and Bayesian elements.
Thompson sampling, introduced by William R. Thompson in 1933 and rediscovered as a strong heuristic for multi-armed bandits in the 2010s, is a Bayesian strategy that samples actions according to the posterior probability that each is optimal. It performs competitively with upper confidence bound algorithms while being simpler to implement and naturally exploration-exploration balanced. Bayesian reinforcement learning more broadly maintains posteriors over transition dynamics and reward functions, enabling principled exploration in model-based methods.
Classical Bayesian inference does not directly produce or train large language models, but Bayesian thinking pervades several adjacent areas. In-context learning has been analyzed as approximate posterior inference over latent task variables. Calibration of LLM token probabilities draws on Bayesian decision theory. Bayesian optimization tunes training hyperparameters and reinforcement learning from human feedback policies. Posterior sampling is used in some approaches to LLM uncertainty quantification and active learning for fine-tuning.
Bayesian inference has known difficulties that practitioners must address. The choice of prior is the most discussed concern. Different reasonable priors can lead to different posteriors when data is sparse, and seemingly innocent uniform priors over parameters can become highly informative on transformed scales. Sensitivity analysis, comparing results across reasonable priors, is the standard mitigation but is rarely performed thoroughly.
Computational cost is a second persistent issue. Even with NUTS and modern hardware, sampling from posteriors over millions or billions of parameters remains impractical. Scalability has driven the rise of variational methods, but variational approximations sacrifice accuracy in ways that are sometimes hard to detect.
Model misspecification is a third concern. Bayesian inference is consistent and well-calibrated only if the model includes the data-generating process or a close approximation. When the model is wrong the posterior may concentrate on parameter values that are best in a Kullback-Leibler sense but produce poor predictions. Robust Bayesian methods, model-checking via posterior predictive plots, and approaches such as the safe Bayes framework address parts of this problem.
Scaling Bayesian methods to deep neural networks remains an open research area. The cold posterior effect, where downweighting the likelihood improves test accuracy, has no fully satisfactory explanation and suggests that current priors over deep network weights are mismatched. The relationship between Bayesian model averaging, ensembling, and stochastic gradient descent dynamics is still being unraveled.
The interpretive flexibility that Bayesians value can also become a weakness in adversarial settings. A motivated researcher can shop priors and likelihoods to produce desired conclusions, just as a frequentist can p-hack. Pre-registration, transparent reporting, and standardized prior choices help but do not eliminate the risk.
The Bayesian community has invested heavily in reproducibility tooling. ArviZ, a Python library by Ravin Kumar and colleagues released in 2019, provides standard diagnostics, plots, and model comparison utilities that work across PyMC, Stan, NumPyro, Pyro, and TensorFlow Probability. Stan's Bayesian workflow, articulated in a 2020 paper by Andrew Gelman and twelve coauthors, codifies a sequence of model-building, prior predictive checking, fitting, diagnostics, posterior predictive checking, and model comparison steps that the community recommends as best practice.
Gaussian process libraries including GPy, GPflow, GPyTorch, and BoTorch implement scalable inference for Gaussian process models. GPyTorch's exploitation of structured kernel matrices and Lanczos methods has pushed Gaussian process inference from datasets of thousands to millions of points.
Mixture models and clustering rely on Bayesian inference in many flavors, from finite Gaussian mixture model inference via expectation-maximization with Bayesian regularization, to nonparametric Dirichlet process mixtures that learn the number of components from data. The scikit-learn implementation of Bayesian Gaussian mixtures is a common entry point for Python practitioners.
The canonical modern textbook is Andrew Gelman, John Carlin, Hal Stern, David Dunson, Aki Vehtari, and Donald Rubin's "Bayesian Data Analysis," third edition, published by Chapman and Hall in 2013, which combines theory, computation, and case studies. Christopher Bishop's "Pattern Recognition and Machine Learning," published by Springer in 2006, devotes its tenth chapter to approximate inference and ties Bayesian ideas to machine learning. David MacKay's "Information Theory, Inference, and Learning Algorithms," published by Cambridge University Press in 2003 and freely available online, develops Bayesian inference alongside information theory and includes some of the clearest expositions of Monte Carlo methods.
Kevin Murphy's "Probabilistic Machine Learning: An Introduction" (2022) and "Probabilistic Machine Learning: Advanced Topics" (2023), both from MIT Press, give comprehensive modern treatments. Edwin Jaynes' posthumous "Probability Theory: The Logic of Science" (2003) presents an opinionated logical view of probability. Judea Pearl's "Probabilistic Reasoning in Intelligent Systems" (1988) and "Causality" (2000) bridge Bayesian inference and causal reasoning. Sharon Bertsch McGrayne's popular history "The Theory That Would Not Die" (2011) traces the social history of Bayes' rule from cryptography to medicine.
For modern Bayesian deep learning, the surveys by Yarin Gal ("Uncertainty in Deep Learning," PhD thesis, University of Cambridge, 2016) and the more recent reviews by Andrew Gordon Wilson, Eric Nalisnick, and others provide good entry points. The Stan Reference Manual and the PyMC documentation are excellent practical references for applied work.