Markov Chain Monte Carlo (MCMC) is a class of algorithms for drawing samples from a probability distribution by constructing a Markov chain whose stationary distribution equals the target distribution [1]. After the chain has been simulated long enough to forget its starting point, the visited states behave as approximate samples from the target, and any expectation under the target can be estimated as a sample average over the chain. MCMC is the workhorse of modern computational statistics because it sidesteps the need to compute normalizing constants or perform analytic integration. Both can be intractable in moderately complex models. The core idea is to design a chain whose transition kernel preserves the target distribution, then run the chain and harvest its visits.
MCMC sits at the intersection of probability, statistics, and physics. The original algorithm was conceived in 1953 at Los Alamos to compute equation-of-state properties of dense fluids on the MANIAC computer [2]. It crossed into mainstream statistics through the Hastings generalization of 1970 [3] and exploded in popularity after Geman and Geman introduced Gibbs sampling for image restoration in 1984 [4] and Gelfand and Smith showed in 1990 how to apply Gibbs sampling to a wide range of Bayesian inference problems. Today MCMC underlies probabilistic programming systems such as Stan, PyMC, and NumPyro, drives Bayesian estimation in econometrics and astronomy, and connects deeply to score-based generative models in machine learning through Langevin dynamics [5].
Many scientific problems reduce to estimating an expectation of a function f under a probability distribution pi on some space X. When pi is a multivariate Gaussian or a low-dimensional standard distribution, the expectation can often be computed analytically or by numerical quadrature. In realistic settings, however, pi is known only up to a normalizing constant. A Bayesian posterior, for example, is proportional to the likelihood times the prior, but the proportionality constant is the marginal likelihood of the data, an integral that is itself usually intractable. Even when pi is tractable, computing expectations by quadrature suffers from the curse of dimensionality. Quadrature error scales exponentially in the number of variables, making naive integration hopeless beyond a handful of dimensions.
Monte Carlo methods replace deterministic integration with random sampling. If one can draw independent samples from pi, the law of large numbers guarantees that the sample average converges to the expectation, and the central limit theorem gives the standard error as a function of sample size. The challenge in modern problems is that drawing independent samples from pi is itself usually impossible. Rejection sampling and importance sampling work in low dimensions but degrade rapidly as the dimension grows because most of the probability mass lives in narrow ridges that random proposals never find efficiently.
MCMC solves this problem by constructing a Markov chain that explores the space according to pi. Rather than producing independent samples, the chain produces correlated samples whose long-run frequency matches pi. The price for correlated samples is a slower effective sample rate; a chain of length N might be worth far fewer than N independent samples. The benefit is that the chain only needs to evaluate the unnormalized density at each visited state, sidestepping the normalizing constant entirely. This single property makes MCMC the dominant method for posterior inference in any model whose marginal likelihood is hard to compute.
A Markov chain on a state space X is a sequence of random variables X_0, X_1, X_2, ... such that the conditional distribution of X_{t+1} given the entire history depends only on X_t. The transition kernel K specifies the conditional distribution of X_{t+1} given X_t. A distribution pi is called stationary for K if drawing X_t from pi and then sampling X_{t+1} from K leaves the marginal distribution of X_{t+1} also equal to pi. If the chain is irreducible (it can reach any region of positive pi-mass from any starting point) and aperiodic (it does not return to states only at regular intervals), then the marginal distribution of X_t converges to pi regardless of the starting state. The chain forgets its initialization at a geometric rate under standard regularity conditions.
Most MCMC algorithms achieve stationarity by enforcing detailed balance. A transition kernel K satisfies detailed balance with respect to pi if pi(x) K(x, y) equals pi(y) K(y, x) for all states x and y. Detailed balance is a stronger condition than stationarity (it implies stationarity but not conversely), and it has the practical advantage that it can be checked locally rather than globally. The Metropolis-Hastings algorithm, Gibbs sampling, slice sampling, and Hamiltonian Monte Carlo all rely on detailed balance or its slightly weaker generalizations.
For a chain that has converged, ergodic averages converge to true expectations under pi by the ergodic theorem, an analog of the law of large numbers for Markov chains. The asymptotic variance of an ergodic average depends on the autocorrelation of the chain. Strongly correlated chains require longer simulations to reach a given Monte Carlo precision, which is why mixing time and effective sample size are central concerns in practical MCMC.
The Metropolis algorithm was published in 1953 by Nicholas Metropolis, Arianna Rosenbluth, Marshall Rosenbluth, Augusta Teller, and Edward Teller in the Journal of Chemical Physics under the title "Equation of State Calculations by Fast Computing Machines" [2]. The original problem was to compute pressure and other thermodynamic properties of two-dimensional fluids of hard disks on the MANIAC I computer at Los Alamos National Laboratory. Direct simulation of the canonical ensemble required integrating over a high-dimensional configuration space weighted by the Boltzmann distribution exp(-E/kT), and the team needed a way to draw samples from this distribution without computing the partition function.
Marshall and Arianna Rosenbluth designed the acceptance rule that bears the Metropolis name. A trial move displaces a single particle by a small random amount and accepts the move with probability one if the new energy is lower, or with probability exp(-Delta E / kT) if the new energy is higher. The chain converges to the canonical ensemble, and time averages of any observable estimate its thermodynamic expectation. The 1953 paper is remarkable both for the algorithm itself and for being among the first published examples of stochastic computer simulation. The MANIAC computer had three thousand vacuum tubes and a few kilobytes of memory, yet the team computed several thousand Markov chain steps for systems of 224 particles, enough to extract sensible pressure-density curves.
The algorithm spent its first two decades almost entirely within statistical physics. It was used to study Lennard-Jones fluids, to estimate critical exponents in lattice spin models, and to compute properties of polymer chains. The statistics community was largely unaware of it. Edward Teller later remarked that the algorithm's translation from physics to statistics took longer than its discovery.
In 1970 the New Zealand statistician W. K. Hastings published a paper in Biometrika titled "Monte Carlo Sampling Methods Using Markov Chains and Their Applications" [3]. Hastings generalized the Metropolis acceptance rule to allow asymmetric proposal distributions. The Metropolis algorithm assumed that the probability of proposing a move from x to y equals the probability of proposing the reverse move; Hastings replaced this with a more flexible acceptance ratio that includes the proposal density in both directions. The generalization, now universally called the Metropolis-Hastings algorithm, opened the door to a much wider range of proposal distributions and allowed practitioners to design proposals that match the local geometry of the target distribution.
Hastings also discussed component-wise updates in which one variable is updated at a time conditional on the others, foreshadowing the Gibbs sampler that Geman and Geman would name fourteen years later. Despite the importance of the paper in retrospect, it received little attention at the time. Hastings published only a handful of follow-up papers on the topic, and the wider statistics community did not begin to exploit the framework until the late 1980s.
The modern era of MCMC began with two papers in the 1980s. Stuart Geman and Donald Geman, working on Bayesian image restoration at Brown University, introduced Gibbs sampling in 1984 in IEEE Transactions on Pattern Analysis and Machine Intelligence [4]. Their paper named the algorithm after the Gibbs distribution from statistical mechanics and applied it to Markov random field models for noisy images. Although the Gemans were unaware of earlier work by Hastings and others, their algorithm is a special case of Metropolis-Hastings in which each proposal is the full conditional distribution of one component given the others, and the acceptance probability is always one.
The second pivotal paper was Tanner and Wong's 1987 paper in JASA on data augmentation, which used iterative simulation to handle missing data and latent variables in Bayesian models. Together with Gelfand and Smith's 1990 paper in JASA, which showed how to use Gibbs sampling for general posterior inference, these contributions ignited a rapid expansion of MCMC across the statistics literature. By the mid-1990s nearly every issue of major statistics journals contained at least one MCMC paper, and the BUGS software (Bayesian inference Using Gibbs Sampling) had made the method accessible to applied users.
Hamiltonian Monte Carlo originated in the lattice quantum chromodynamics community. Duane, Kennedy, Pendleton, and Roweth published the algorithm in Physics Letters B in 1987 under the name Hybrid Monte Carlo, designed to accelerate Markov chain simulation of lattice gauge theories [6]. The method augments the target with auxiliary momentum variables and uses Hamiltonian dynamics to propose long-range moves, leveraging gradient information to escape the random-walk behavior that plagues simple Metropolis methods. Radford Neal introduced HMC to the statistics community in his 1993 thesis and, more decisively, in his influential chapter "MCMC Using Hamiltonian Dynamics" in the 2011 Handbook of Markov Chain Monte Carlo [7]. Neal's chapter is the standard pedagogical reference for HMC and remains the entry point for most students of the method.
The No-U-Turn Sampler (NUTS) by Matthew Hoffman and Andrew Gelman, published in the Journal of Machine Learning Research in 2014, removed the last major obstacle to widespread HMC adoption [8]. HMC requires the user to specify a step size and a number of leapfrog steps per proposal, and these tuning parameters strongly influence efficiency. NUTS automatically adapts both. It builds a binary tree of leapfrog steps in both forward and backward time directions and stops as soon as the trajectory begins to double back on itself, eliminating the need to specify a path length. Combined with dual averaging for step-size adaptation, NUTS allows HMC to be used as a black-box sampler with essentially no manual tuning. NUTS is the default sampler in Stan and is also used in PyMC and NumPyro.
Reversible jump MCMC, introduced by Peter Green in 1995, generalized Metropolis-Hastings to spaces of varying dimension, allowing chains to move between models with different numbers of parameters. Reversible jump MCMC enabled fully Bayesian model selection in settings such as mixture models with an unknown number of components, regression with variable selection, and nonparametric tree models.
Metropolis-Hastings is the foundational MCMC algorithm and the parent of nearly every other method in the family. Given a target density pi (known up to a constant) and a proposal density q(y | x), the algorithm proceeds by drawing a candidate y from q(y | X_t), computing the acceptance probability alpha = min(1, [pi(y) q(X_t | y)] / [pi(X_t) q(y | X_t)]), and setting X_{t+1} equal to y with probability alpha or X_t otherwise. The acceptance ratio depends only on ratios of pi, so the normalizing constant cancels.
The choice of proposal density determines the chain's efficiency. A symmetric Gaussian proposal centered at the current state is the simplest option and recovers the original Metropolis algorithm. The proposal scale should be tuned so that the chain accepts roughly 23 percent of moves in high-dimensional Gaussian-like targets, a result derived by Roberts, Gelman, and Gilks in the late 1990s. Smaller proposals lead to higher acceptance but slower exploration, while larger proposals explore more aggressively but are rejected more often. The autocorrelation time depends sensitively on proposal scale, and adaptive Metropolis algorithms tune the scale automatically based on the empirical covariance of past samples.
Metropolis-Hastings is robust and easy to implement but suffers from random-walk behavior in high dimensions. The chain takes on the order of d^2 steps to traverse a d-dimensional Gaussian target, making it impractical for problems with hundreds or thousands of parameters unless combined with a smarter proposal mechanism such as Hamiltonian dynamics or component-wise updating.
Gibbs sampling updates one variable (or block of variables) at a time, drawing each from its conditional distribution given the current values of all the others [4]. For a target pi(x_1, ..., x_d), a sweep consists of drawing x_1 from pi(x_1 | x_2, ..., x_d), then x_2 from pi(x_2 | x_1, x_3, ..., x_d), and so on through x_d. Each conditional draw is itself a Metropolis-Hastings step with acceptance probability one, so Gibbs sampling can be viewed as a special case of Metropolis-Hastings.
Gibbs sampling is most useful when the conditional distributions have closed-form expressions. Conjugate Bayesian models such as Gaussian-Gaussian regression, Beta-Binomial models, Dirichlet-Multinomial mixtures, and many hierarchical models have tractable conditionals, which is why Gibbs sampling rose to prominence in Bayesian statistics during the 1990s. The BUGS software automated the construction of Gibbs samplers from a graphical model specification, dramatically lowering the barrier to entry.
Gibbs sampling can also be combined with Metropolis-Hastings updates for components without closed-form conditionals, a hybrid sometimes called Metropolis-within-Gibbs. The chain remains valid as long as each update preserves the joint target. The main weakness of Gibbs sampling is that highly correlated variables produce a chain that mixes slowly, since each component is updated along the coordinate axis even when the target's principal direction is diagonal. Block updates and reparameterizations can mitigate the problem.
Slice sampling, introduced by Radford Neal in a 2003 Annals of Statistics paper, provides a black-box alternative to Metropolis-Hastings that avoids tuning a proposal scale. The basic idea is to introduce an auxiliary uniform variable u between zero and pi(x), then alternate between drawing u given x and drawing x uniformly from the slice {y : pi(y) > u}. The chain is provably correct and adapts naturally to the local scale of the target. The main implementation challenge is identifying the slice, which Neal solves with a stepping-out procedure that grows an interval until both endpoints lie outside the slice, followed by a shrinkage procedure that contracts the interval until a point inside the slice is sampled.
Slice sampling is particularly attractive for univariate components and for use within Gibbs sampling, where it can replace Metropolis-Hastings updates that would otherwise require careful proposal tuning. It is also the default sampler in many software packages for variables that lack tractable conditional forms.
Hamiltonian Monte Carlo (HMC) augments the target distribution with auxiliary momentum variables p drawn from a multivariate Gaussian, then proposes new states by simulating Hamiltonian dynamics on the joint position-momentum space [7]. The Hamiltonian H(x, p) equals the negative log of the target plus a kinetic energy term (half the inner product of the momentum with itself, possibly preconditioned by a mass matrix). Hamiltonian dynamics conserves H exactly in continuous time and approximately under numerical integration, so a long deterministic simulation produces a far-away proposal that nonetheless has nearly the same density as the starting point.
HMC uses the leapfrog integrator, a symplectic scheme that alternates half-step momentum updates with full-step position updates. The leapfrog integrator is volume-preserving and time-reversible, two properties that allow the Metropolis-Hastings acceptance ratio to be computed exactly even though the proposal is generated by a complicated dynamical simulation. After running a fixed number of leapfrog steps, the proposal is accepted with probability min(1, exp(H_old - H_new)). Energy conservation under leapfrog is excellent in practice, so acceptance rates of 60 to 90 percent are typical, and the chain can take very long jumps in a single step.
The central advantage of HMC over random-walk Metropolis is that it eliminates the random-walk behavior that limits high-dimensional sampling. Where random-walk Metropolis takes O(d^2) steps to traverse a d-dimensional target, HMC takes only O(d^{1/4}) steps, an exponential improvement. The cost is that HMC requires gradients of the log density, which adds a per-step computational burden but is feasible for any differentiable model.
The No-U-Turn Sampler, introduced by Hoffman and Gelman in 2014, is the most widely used HMC variant in modern probabilistic programming [8]. The key innovation is a recursive doubling algorithm that builds a binary tree of leapfrog states forward and backward in fictitious time, doubling the path length at each iteration. The expansion stops as soon as the trajectory begins to retrace itself, detected by a U-turn condition that compares the displacement vector with the momentum at each tree boundary. Once the tree is complete, NUTS samples uniformly from the visited states subject to a slice-sampling constraint that ensures detailed balance.
NUTS also adapts the leapfrog step size during a warmup phase using a primal-dual averaging scheme that targets a user-specified average acceptance probability (typically 0.8). After warmup the step size is fixed and sampling proceeds without further adaptation. The combination of automatic tree-length adaptation and step-size adaptation makes NUTS essentially a black-box sampler. The user supplies a log-density function and its gradient, and NUTS handles the rest.
NUTS is the default backend in Stan, the algorithm of choice in PyMC's modern API, and the foundation of NumPyro's sampling stack. Its empirical efficiency, robustness, and absence of tunable knobs have made it the de facto standard for general-purpose Bayesian inference on continuous parameter spaces.
Reversible jump MCMC, due to Peter Green in 1995, extends Metropolis-Hastings to settings where the dimension of the parameter space is itself a random variable. A typical application is fitting a mixture model whose number of components is unknown. Each model in the family has a different parameter dimension, and the chain must move between dimensions in a way that preserves the joint target on the union of model spaces. The algorithm achieves this by introducing dimension-matching auxiliary variables and using a Jacobian factor in the acceptance ratio.
Reversible jump MCMC is more difficult to implement than fixed-dimension methods and requires careful design of trans-dimensional proposals. Software support is limited compared to Stan-style fixed-dimension samplers, and many practitioners use marginal likelihood estimation or variational methods for model selection rather than reversible jump MCMC. The algorithm remains the rigorous gold standard when fully Bayesian model averaging or model selection is required.
Parallel tempering, also known as replica exchange MCMC, runs multiple chains in parallel at different temperatures, where the t-th chain samples from a tempered version of the target proportional to pi(x)^{1/T_t}. Higher temperatures flatten the distribution, allowing the corresponding chain to traverse barriers between modes more easily. Periodically, neighboring chains attempt to swap states using a Metropolis-Hastings acceptance probability that preserves their respective stationary distributions. Information flows from the high-temperature chains down through the temperature ladder to the cold chain, which targets the original distribution.
Parallel tempering is the standard technique for sampling from multimodal distributions where vanilla MCMC would become trapped in a single mode. It is widely used in molecular dynamics, protein folding, and Bayesian phylogenetics, and has seen renewed interest in machine learning for sampling from energy-based models and posterior distributions of neural networks.
The following table summarizes the main MCMC algorithms and their relative strengths.
| Algorithm | Year | Uses gradients | Tuning needed | Strengths | Weaknesses |
|---|---|---|---|---|---|
| Metropolis | 1953 | No | Proposal scale | Simple, robust, general | Random-walk behavior in high d |
| Metropolis-Hastings | 1970 | No | Proposal density | Flexible proposals, asymmetric jumps | Same scaling limits as Metropolis |
| Gibbs sampling | 1984 | No | Block structure | Closed-form when conditionals known | Slow with strong correlations |
| Slice sampling | 2003 | No | Almost none | Adapts to local scale | Sometimes slow per step |
| Hamiltonian Monte Carlo | 1987 | Yes | Step size, path length | Excellent high-d scaling | Needs differentiable target |
| NUTS | 2014 | Yes | None (after warmup) | Black-box HMC | Hard to extend to constrained spaces |
| Reversible jump | 1995 | Optional | Trans-dimensional proposals | Handles variable model dimension | Difficult to implement |
| Parallel tempering | 1991 | Optional | Temperature ladder | Crosses barriers between modes | Many chains needed |
| Particle MCMC | 2010 | No | Particle count | State-space models with latent paths | Computationally expensive |
MCMC samples are useful only after the chain has converged to its stationary distribution. Diagnosing convergence is a fundamentally hard problem because the chain's only output is the sample path itself, which gives limited information about the parts of the space that have not been visited. Practical MCMC therefore relies on a battery of diagnostics that detect common failure modes rather than guaranteeing convergence.
A trace plot displays the value of a single parameter (or a derived quantity such as the log posterior) against iteration index. A converged, well-mixing chain produces a stationary, hairy time series with no trends, no large excursions, and no obvious dependence on the starting value. Multiple chains run from different starting points should overlap and become indistinguishable after the warmup phase. Trace plots that drift, that show long sojourns in distinct regions, or that fail to overlap across chains are signs of poor mixing or unconverged chains. Modern recommendations sometimes replace trace plots with rank plots, which compare the empirical rank distribution of each chain across all chains and are more sensitive to subtle failures.
The R-hat statistic, introduced by Andrew Gelman and Donald Rubin in 1992, compares the variance within each chain to the variance between chains. It is computed by running m parallel chains of length n from overdispersed starting points, then forming W (the average within-chain variance) and B (the variance of the chain means scaled by n). The marginal posterior variance is estimated as a weighted average, and R-hat is the square root of the ratio of this estimate to W. If the chains have converged to the same stationary distribution, R-hat should be close to one. A common threshold is R-hat below 1.01 (the modern stricter cutoff popularized by Vehtari and colleagues) or 1.1 (the original looser cutoff). Values larger than the threshold indicate that the chains have not yet mixed.
The original R-hat has been refined repeatedly. Brooks and Gelman (1998) proposed a multivariate version. Vehtari et al. (2021) introduced a rank-normalized R-hat that is more reliable for chains with heavy tails or non-Gaussian distributions, and they recommend computing R-hat separately for the bulk and tails of the distribution. Many modern packages report this improved R-hat by default.
The effective sample size (ESS) measures how many independent samples the correlated chain is worth. It is defined as the chain length divided by the integrated autocorrelation time, the sum of all autocorrelations from lag zero to infinity. A chain with N iterations and integrated autocorrelation time tau yields an effective sample size of N/tau. Strong autocorrelations make ESS much smaller than N. A typical recommendation is to require an effective sample size of at least a few hundred for any quantity of interest before declaring inference reliable. Vehtari et al. (2021) introduced a more reliable ESS estimator that splits the chain into halves to detect non-stationarity and computes separate ESS values for the bulk and tails.
Monte Carlo standard errors compare the precision of the MCMC estimate to the desired precision. If the standard error is small relative to the posterior standard deviation, the chain has produced enough samples for the practical purpose at hand. The Heidelberger-Welch test detects non-stationarity in a single chain. The Geweke test compares the means of the early and late portions of a chain to detect drift. None of these methods can prove convergence; they can only detect failures.
MCMC is the standard method for posterior inference in Bayesian statistics, with applications in essentially every quantitative discipline. The following table illustrates the breadth of MCMC use across domains.
| Domain | Typical use case | Common method |
|---|---|---|
| Bayesian regression | Hierarchical regression, generalized linear models | NUTS, Gibbs sampling |
| Phylogenetics | Inferring evolutionary trees from sequence data | Metropolis-coupled MCMC, parallel tempering |
| Astronomy | Fitting cosmological models, exoplanet orbit estimation | Affine-invariant ensemble samplers, HMC |
| Econometrics | Stochastic volatility, dynamic factor models | Particle MCMC, Gibbs sampling |
| Epidemiology | Compartmental disease models, contact tracing | Particle MCMC, Metropolis-Hastings |
| Bayesian deep learning | Posterior over neural network weights | Stochastic gradient HMC, Langevin dynamics |
| Generative modeling | Sampling from energy-based models, diffusion models | Langevin dynamics, parallel tempering |
| Mixture models | Inference in Gaussian mixture models | Gibbs sampling, reversible jump |
| Image processing | Markov random fields, image restoration | Gibbs sampling, simulated annealing |
| Topic modeling | Latent Dirichlet allocation, hierarchical Dirichlet processes | Collapsed Gibbs sampling |
The canonical application of MCMC is Bayesian posterior inference. Given a prior p(theta) and a likelihood p(data | theta), Bayes's theorem states that the posterior is proportional to the product of the two. The marginal likelihood in the denominator is usually intractable, but MCMC sidesteps it because the chain only needs the unnormalized posterior. Hierarchical models, in which parameters at one level are drawn from distributions parameterized at a higher level, are particularly well suited to Gibbs sampling because the conditional distributions often have closed forms. Stan, PyMC, and NumPyro have made hierarchical Bayesian modeling routine in fields ranging from political science to ecology.
Bayesian deep learning treats the weights of a neural network as random variables and aims to infer the posterior distribution over weights given training data. The posterior captures epistemic uncertainty about the model and supports principled prediction intervals. Posterior inference for deep learning models is challenging because the parameter space has hundreds of millions of dimensions and the posterior is highly non-Gaussian.
Full-batch HMC has been shown by Izmailov and colleagues (2021) to give the gold-standard posterior estimate for moderately sized Bayesian neural networks (BNNs), but it requires running through the full dataset for every leapfrog step. Stochastic gradient HMC, stochastic gradient Langevin dynamics, and related methods scale to larger datasets by replacing exact gradients with mini-batch estimates and adding a noise term to maintain the correct stationary distribution. The cost is that the noise injected by the mini-batch estimate biases the stationary distribution slightly, and recovering the exact posterior requires careful step-size scheduling. In practice, stochastic gradient MCMC is often combined with cold posterior tempering or with variational approximations to balance scalability and accuracy.
Symmetric splitting HMC (Cobb and Jalaian, 2021) has enabled exact full-batch HMC for neural networks with millions of parameters by exploiting the structure of the gradient computation. Combined with modern accelerator hardware, these methods are bringing full Bayesian inference within reach for production-scale neural networks.
Langevin dynamics, the continuous-time stochastic process whose stationary distribution is a target density pi, is the bridge between MCMC and modern score-based generative modeling [5]. The discretized Langevin update steps in the direction of the score (the gradient of the log density) and adds Gaussian noise calibrated to the step size. In the small-step-size limit, the discretization converges to a sample from pi.
Score-based generative models, introduced by Yang Song and Stefano Ermon in 2019 and unified with diffusion models in subsequent work, train a neural network to estimate the score of a noise-perturbed data distribution at multiple noise scales. Sampling proceeds by annealed Langevin dynamics: starting from pure noise, the model runs Langevin updates at progressively smaller noise scales, with each scale using a different score estimate. The resulting algorithm is essentially MCMC sampling from a sequence of distributions that interpolate between an easy-to-sample base distribution and the data distribution. This perspective unifies diffusion models, score matching, and MCMC, and recent work continues to exploit MCMC ideas to improve sampler efficiency in generative models.
The critically-damped Langevin diffusion approach by NVIDIA augments the data variable with auxiliary momentum, mirroring the construction in HMC and yielding faster mixing for high-dimensional generative models. The exchange of ideas between the MCMC and generative modeling communities continues to be an active research frontier.
Particle filters, also known as sequential Monte Carlo (SMC) methods, are not strictly MCMC algorithms but are closely related and often combined with MCMC. They estimate the posterior distribution over latent states in nonlinear non-Gaussian state-space models by maintaining a population of weighted particles that evolve through time according to the state dynamics and are reweighted by their likelihood under each new observation. Particle MCMC, introduced by Andrieu, Doucet, and Holenstein in 2010, embeds a particle filter inside a Metropolis-Hastings or Gibbs sampler to perform fully Bayesian inference for parameters of state-space models with latent state trajectories. The combined algorithm has become a standard tool in epidemiology, ecology, and finance.
The following table compares the major probabilistic programming and MCMC libraries.
| Software | Language | Default sampler | Strengths | Notes |
|---|---|---|---|---|
| Stan | Stan DSL, R, Python | NUTS | Industry standard for HMC, mature ecosystem | Curly-brace block syntax |
| PyMC | Python | NUTS (with JAX backend optional) | Pythonic API, broad model library | Originally PyMC3, rewrote with PyTensor |
| NumPyro | Python (JAX) | NUTS | JAX-based, fast on GPU | Lightweight, JAX-native |
| Pyro | Python (PyTorch) | NUTS, HMC | Deep universal probabilistic programming | Designed for variational inference too |
| TensorFlow Probability | Python (TF) | HMC, NUTS | Integrates with TensorFlow ecosystem | Lower-level than Stan or PyMC |
| Turing.jl | Julia | NUTS, HMC, particle MCMC | Composable, Julia-native | Strong performance for hierarchical models |
| BUGS / WinBUGS | BUGS DSL | Gibbs | Original Bayesian inference engine | Largely superseded by Stan and PyMC |
| JAGS | BUGS-like DSL | Gibbs, slice sampling | Successor to BUGS | Used in R via rjags |
| emcee | Python | Affine-invariant ensemble | Popular in astronomy | Gradient-free, derivative-free |
| BlackJAX | Python (JAX) | NUTS, HMC, SMC, others | Modular sampler library | Used as backend by other PPLs |
Stan, written by Andrew Gelman, Bob Carpenter, and colleagues at Columbia, was released in 2012 and remains the industry standard for HMC-based Bayesian inference [9]. Models are written in a domain-specific language with curly-brace blocks for parameters, transformed parameters, and the model specification, then compiled to C++. The default sampler is dynamic NUTS with several modern improvements, including multinomial sampling of trajectories and improved metric adaptation.
PyMC began life in 2009 as a pure-Python Metropolis-Hastings library, evolved through PyMC3 (which used Theano for autodiff) into the modern PyMC built on the PyTensor backend [10]. The default sampler is NUTS, with the option to use external samplers from BlackJAX or NumPyro for faster execution on GPUs. PyMC's API is designed to feel like idiomatic Python, with parameters declared via context managers and probabilistic programs expressed as ordinary Python code.
NumPyro, built by Uber AI Labs and continued by the Pyro team, uses JAX as its computational backend and offers among the fastest NUTS implementations available, particularly on GPU and TPU hardware. Benchmarks on Bayesian item response models and other hierarchical structures have shown NumPyro outperforming PyStan and PyMC by factors of two to ten depending on the problem and hardware.
For scientists working with state-space models, particle MCMC implementations are available in libraries such as particles (Python) and pomp (R). For machine learning researchers exploring stochastic gradient MCMC, libraries such as bayes-torch and the Bayesian inference modules in TensorFlow Probability provide implementations of stochastic gradient Langevin dynamics, stochastic gradient HMC, and related methods.
MCMC chains are usually started from arbitrary initial values that are not draws from the target distribution. The early portion of the chain is therefore not representative of the target and is typically discarded as burn-in (in the Metropolis tradition) or warmup (in the Stan tradition). Stan and other modern HMC implementations use the warmup period to adapt the step size, the mass matrix, and other tuning parameters, after which the chain runs without further adaptation. Choosing the burn-in length is a judgment call. Trace plots, R-hat, and ESS diagnostics provide guidance, but no single rule fits all problems. A common default in HMC is 1000 warmup iterations followed by 1000 sampling iterations, but harder problems may require ten or one hundred times more.
Thinning means discarding all but every k-th sample from the chain to reduce storage and autocorrelation. While intuitively appealing, thinning is statistically inefficient unless storage is the binding constraint, since it discards information that could be used in the ergodic average. Most modern guidance recommends keeping all samples and reporting effective sample size rather than relying on thinning to produce nominally independent draws.
Most MCMC algorithms are local in the sense that proposals concentrate near the current state. When the target has multiple well-separated modes, the chain may become trapped in one mode for a long time, producing a sample that misrepresents the global distribution. Parallel tempering, simulated annealing, and population-based methods are designed to address multimodality. In Bayesian deep learning, multimodality is the rule rather than the exception, and recent work explores ways to combine local MCMC with cycle-based methods or with explicit mode-jumping proposals.
The efficiency of MCMC algorithms can depend dramatically on the parameterization of the model. Centered parameterizations of hierarchical models, where group-level parameters are drawn directly from population distributions, often produce funnel-shaped posteriors with extreme curvature that slow down NUTS. Non-centered parameterizations, where group-level parameters are written as a population mean plus a scaled standard normal deviate, often sample much more efficiently. Reparameterization is a routine part of model debugging in modern Bayesian workflows and is documented extensively in the Stan user guide.
MCMC is a remarkably general framework but has well-known limitations. Sampling is fundamentally serial in nature; producing N samples requires N sequential evaluations of the target density. This makes MCMC slower than variational inference for problems with many parameters or large datasets. Even with HMC's improved high-dimensional scaling, posterior inference for state-of-the-art deep learning models remains computationally prohibitive without approximations.
Multimodality remains a persistent challenge. While parallel tempering and ensemble samplers help in moderate-dimensional problems, no general-purpose method efficiently handles posteriors with many widely separated modes in very high dimensions. Combining MCMC with optimization or variational methods to identify modes, then using local MCMC to characterize each mode, is a common compromise.
The theoretical foundations of MCMC for non-reversible chains, for chains with adaptive proposals, and for stochastic gradient variants continue to be active research areas. Non-reversible MCMC, which violates detailed balance but preserves stationarity, can sometimes mix faster than reversible methods. Adaptive MCMC, which tunes proposal distributions based on past samples, can match the efficiency of well-tuned chains but requires careful design to maintain ergodicity. Stochastic gradient MCMC for machine learning introduces bias that depends on step size, batch size, and gradient noise, and characterizing this bias remains an open problem.
The interplay between MCMC and modern AI is still being worked out. As diffusion models increasingly dominate generative modeling and as Bayesian methods make inroads into deep learning, MCMC ideas continue to find new applications even as the techniques themselves continue to evolve. The 2011 Handbook of Markov Chain Monte Carlo, edited by Brooks, Gelman, Jones, and Meng, remains the most comprehensive reference for the field [1], and Robert and Casella's Monte Carlo Statistical Methods is the standard graduate-level text [11].