Importance sampling
Last reviewed
May 1, 2026
Sources
22 citations
Review status
Source-backed
Revision
v1 · 3,713 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
22 citations
Review status
Source-backed
Revision
v1 · 3,713 words
Add missing citations, update stale details, or suggest a clearer explanation.
Importance sampling (often abbreviated IS) is a Monte Carlo variance-reduction technique for estimating expectations under one probability distribution by drawing samples from a different proposal distribution and reweighting each sample by a likelihood ratio. The technique is one of the oldest tools in computational statistics, with roots in 1940s rare-event simulation work at Los Alamos and a textbook treatment in Hammersley and Handscomb's 1964 monograph Monte Carlo Methods. It now sits at the centre of off-policy reinforcement learning, variational inference, particle filtering, simulation-based Bayesian inference, and counterfactual evaluation in industrial recommender systems.
Given a target density $p$, a proposal density $q$ with $q(x)>0$ wherever $p(x)f(x)\neq 0$, and a function $f$ whose expectation we want, the basic identity is
$$\mathbb{E}_p[f(X)] = \int f(x),p(x),dx = \int f(x),\frac{p(x)}{q(x)},q(x),dx = \mathbb{E}_q!\left[f(X),w(X)\right],$$
where the importance weight $w(x)=p(x)/q(x)$ corrects for the mismatch between sampling distribution and target. The standard estimator is
$$\hat\mu_{\text{IS}} = \frac{1}{N}\sum_{i=1}^N f(x_i),w(x_i), \qquad x_i \sim q.$$
When the supports are correct the estimator is unbiased; in practice the choice of $q$ controls almost everything that can go right or wrong with the method.
Three forces keep importance sampling at the centre of modern probabilistic computing:
Different fields rediscovered the same equation under different names. Statisticians call $w$ the importance weight. Causal inference calls it the inverse propensity score. Survey methodologists call it a sampling weight. Reinforcement learning calls $\rho_t = \pi(a|s)/\mu(a|s)$ the importance ratio. The arithmetic is identical.
When $q$ dominates $|f|p$ (that is, $q(x)>0$ wherever $f(x)p(x)\neq 0$), $\hat\mu_{\text{IS}}$ is unbiased for $\mathbb{E}_p[f(X)]$. Its variance is
$$\operatorname{Var}_q!\left[f(X),w(X)\right] = \mathbb{E}_q!\left[f(X)^2 w(X)^2\right] - \bigl(\mathbb{E}_p[f(X)]\bigr)^2 = \int \frac{f(x)^2 p(x)^2}{q(x)},dx - \mu^2.$$
The second moment can be infinite even when $\mathbb{E}_p[f]$ is finite. Owen's Monte Carlo theory, methods and examples (Chapter 9) gives precise conditions: a sufficient finite-variance condition is that $w(x)$ is bounded, i.e. $\sup_x p(x)/q(x) < \infty$. When this fails the estimator can still be unbiased but its variance is infinite, the central limit theorem does not apply, and confidence intervals built from sample variance are meaningless.
Minimising the second moment over $q$ subject to $\int q = 1$ gives the optimal proposal
$$q^*(x) \propto |f(x)|,p(x).$$
For non-negative $f$ this $q^$ achieves zero variance; a single sample suffices. The catch is that $q^$'s normalising constant is exactly $\mathbb{E}_p[|f(X)|]$, the quantity we wanted in the first place. The optimal proposal is therefore unattainable, but it is a useful target: a good $q$ should look roughly like $|f|p$, putting mass where $f$ is large and $p$ is non-negligible.
The most common diagnostic for an importance sample is the effective sample size (ESS):
$$n_{\text{eff}} = \frac{\left(\sum_{i=1}^N w_i\right)^2}{\sum_{i=1}^N w_i^2}.$$
Its value lies between $1$ (one weight dominates) and $N$ (uniform weights). An ESS of $350$ from $N=1000$ samples means that the weighted estimator carries about as much information as $350$ direct draws from $p$. ESS is widely used in sequential Monte Carlo to decide when to resample particles. It is a necessary diagnostic but not a sufficient one: a high ESS can still hide catastrophic tail behaviour if $q$ misses an important mode entirely.
When $p$ is known only up to a normalising constant (the usual situation in Bayesian inference where $p(x)\propto \tilde p(x)$), the unnormalised weights $\tilde w(x)=\tilde p(x)/q(x)$ are still computable but $\hat\mu_{\text{IS}}$ is not. The standard fix is the self-normalised estimator (SNIS):
$$\tilde w_i = \frac{\tilde p(x_i)/q(x_i)}{\sum_{j=1}^N \tilde p(x_j)/q(x_j)}, \qquad \hat\mu_{\text{SNIS}} = \sum_{i=1}^N \tilde w_i,f(x_i).$$
SNIS is biased (the ratio of two random variables) but consistent, and it often has lower mean-squared error than the vanilla estimator because of the cancellation between numerator and denominator fluctuations. The bias scales as $O(1/N)$ while the variance scales as $O(1/N)$, so the bias becomes negligible quickly. Robert and Casella's Monte Carlo Statistical Methods (Chapter 3) treats SNIS as the default importance sampler in Bayesian practice. Recent work such as Cardoso et al.'s 2022 BR-SNIS uses iterated sampling-importance-resampling to reduce the bias further at essentially the same cost.
Vanilla importance sampling is only one of a family of variance reduction tools, and it is often combined with the others.
| Technique | Idea | Typical use |
|---|---|---|
| Antithetic variables | Pair each $x_i$ with $-x_i$ (or another negatively correlated draw) | Symmetric integrands, simulation studies |
| Control variates | Subtract a known-mean function correlated with $f$ | Pricing financial derivatives, RL baselines |
| Stratified sampling | Partition the domain and sample within each stratum | Survey statistics, low-discrepancy QMC |
| Multiple importance sampling (Veach 1995) | Combine samples from several proposals via the balance heuristic | Path-traced rendering, bidirectional path tracing |
| Adaptive importance sampling | Refine $q$ from past samples (population Monte Carlo, AMIS) | Bayesian model evidence, simulation-based inference |
| Annealed importance sampling (Neal 2001) | Move along a tempered sequence $p_0,\ldots,p_n$ via MCMC kernels and accumulate weights | Marginal likelihood estimation, hard posteriors |
Multiple importance sampling deserves a closer look. Eric Veach and Leonidas Guibas introduced it at SIGGRAPH 1995 in Optimally Combining Sampling Techniques for Monte Carlo Rendering, and it remains the workhorse of production renderers. Their balance heuristic weights each sample by the proposal density that produced it relative to the sum of all proposal densities, which is provably close to optimal in the single-sample case.
Importance sampling is the standard correction in off-policy reinforcement learning, where a behaviour policy $\mu$ generates trajectories and a different target policy $\pi$ is evaluated or improved. The setting is unavoidable in three places: replay buffers in deep RL, observational data in healthcare and recommendation systems, and parallel actor-learner architectures where the actors run a stale policy by the time the learner uses their data.
For a single full trajectory $\tau = (s_0, a_0, r_1, s_1, a_1, \ldots, s_T)$ generated by $\mu$, the trajectory importance ratio is
$$\rho_{0:T} = \prod_{k=0}^{T} \frac{\pi(a_k\mid s_k)}{\mu(a_k\mid s_k)}.$$
An unbiased estimator of the value $V_\pi(state)$ for an episodic Markov decision process is
$$\hat V_\pi(s) = \frac{1}{N}\sum_{i=1}^N \rho^{(i)}_{0:T},G^{(i)},$$
where $G^{(i)}$ is the discounted return of the $i$-th trajectory. Sutton and Barto's Reinforcement Learning: An Introduction (Chapter 5) develops this ordinary and the corresponding weighted importance sampling estimator and works out their bias/variance trade-off. The trouble is the product: a long trajectory multiplies many ratios, the variance of $\rho_{0:T}$ explodes exponentially in $T$, and a single lucky episode can swing the average by orders of magnitude. Sutton and Barto put it bluntly: weighted importance sampling "learns only from the tails of episodes," because most weights are negligible.
Doina Precup, Richard Sutton and Satinder Singh introduced per-decision importance sampling in their ICML 2000 paper Eligibility Traces for Off-Policy Policy Evaluation. The trick is that the reward at time $t$ only depends on the actions up to time $t$, so its expectation needs only the partial product $\rho_{0:t}$, not the full trajectory ratio. This shrinks the effective variance considerably and is the foundation of every modern off-policy return estimator. The same paper analyses five eligibility-trace algorithms that combine importance sampling with TD learning and proves their consistency.
More recent work tames the variance by truncating the ratios at the cost of some bias.
Tree backup. Replaces the importance ratio with the target-policy probability, eliminating variance from the behaviour policy at the cost of off-policyness.
Retrace($\lambda$) (Munos et al., NeurIPS 2016). Uses the truncated ratio $\bar\rho_t = \lambda\min(1, \pi(a_t|s_t)/\mu(a_t|s_t))$ in a return-based off-policy update. Munos and colleagues proved Retrace is the first return-based off-policy control algorithm that converges almost surely to $Q^*$ without a Greedy-in-the-Limit-with-Infinite-Exploration assumption, and it solved a long-standing open question about Watkins's $Q(\lambda)$.
V-trace (Espeholt et al., IMPALA, ICML 2018). Uses two clipping constants $\bar\rho$ and $\bar c$ on per-step ratios so a distributed actor-learner architecture can absorb policy lag from thousands of asynchronous workers. IMPALA achieves about 250,000 frames per second on a single learner with hundreds of actors and is the workhorse of large-scale on-policy-style training.
PPO clipping (Schulman et al. 2017). Proximal policy optimization's flagship objective is
$$L^{\text{CLIP}}(\theta) = \mathbb{E}_t!\left[\min!\left(r_t(\theta)\hat A_t,;\operatorname{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat A_t\right)\right],$$
where $r_t(\theta) = \pi_\theta(a_t|s_t)/\pi_{\theta_{\text{old}}}(a_t|s_t)$ is a one-step importance ratio. The clip constant (typically $\epsilon=0.2$) prevents the ratio from blowing up when the new policy strays from the old, which would otherwise destabilise gradient updates. PPO is now the default policy-gradient method in most deep-RL libraries.
Kingma and Welling's 2013 Variational Autoencoder maximises the evidence lower bound (ELBO) on $\log p(x)$ using a single sample from the recognition network $q_\phi(z\mid x)$. The ELBO can be tightened with importance weighting.
Burda, Grosse and Salakhutdinov's Importance Weighted Autoencoders (ICLR 2016) introduced the IWAE bound
$$\mathcal{L}K^{\text{IWAE}}(x) = \mathbb{E}{z_1,\ldots,z_K\sim q_\phi(\cdot\mid x)}!\left[\log \frac{1}{K}\sum_{k=1}^K \frac{p_\theta(x, z_k)}{q_\phi(z_k\mid x)}\right].$$
By Jensen's inequality this is a lower bound on $\log p_\theta(x)$ for every $K\ge 1$, with $\mathcal{L}1^{\text{IWAE}} = \text{ELBO}$, and the bound is monotonically tighter as $K$ grows, converging to $\log p\theta(x)$ in the limit. The trick is the position of the logarithm: the VAE takes the average of logs, the IWAE takes the log of an average of importance-weighted likelihood ratios. IWAE-style $K$-sample bounds are now the standard way to evaluate the marginal likelihood of normalising flows, diffusion models, and other deep generative models.
Closely related techniques include Reweighted Wake-Sleep (Bornschein and Bengio, ICLR 2015), which uses importance-weighted wake-phase updates to train discrete generative models, and Annealed Importance Sampling (Neal 2001), which constructs a chain of intermediate distributions $p_0,\ldots,p_n$ between a tractable proposal and the target and accumulates importance weights along an MCMC trajectory. AIS is the standard estimator of marginal likelihoods for restricted Boltzmann machines, deep belief networks, and recently for evaluating diffusion-model log-likelihoods.
In state-space models the target is a sequence of posteriors $p(x_{0:t}\mid y_{1:t})$ that grows in dimension over time. Sequential importance sampling (SIS) extends an importance sample one time step at a time, multiplying weights by $w_t \propto p(y_t\mid x_t),p(x_t\mid x_{t-1})/q(x_t\mid x_{t-1}, y_t)$.
SIS suffers from weight degeneracy: after a few time steps almost all probability mass concentrates on a single particle. The fix is resampling: at each step (or whenever the ESS falls below a threshold), draw a new particle population proportional to current weights. This Sampling-Importance-Resampling (SIR) algorithm is what Gordon, Salmond and Smith introduced in their 1993 IEE paper Novel approach to nonlinear/non-Gaussian Bayesian state estimation, popularising the bootstrap filter. The bootstrap filter and its descendants underpin robot localisation (FastSLAM), object tracking, target tracking in radar, epidemic forecasting, and probabilistic programming languages such as Anglican and Pyro.
| Application area | How importance sampling is used | Representative reference |
|---|---|---|
| Off-policy value estimation | Reweight returns by trajectory or per-decision ratios | Precup, Sutton and Singh 2000 |
| Distributed deep RL | V-trace clipping in actor-learner architectures | Espeholt et al. (IMPALA) 2018 |
| Policy optimisation | PPO clipped surrogate, TRPO surrogate | Schulman et al. 2017 |
| Variational inference | IWAE tighter ELBO, Reweighted Wake-Sleep | Burda et al. 2016 |
| Marginal likelihood | Annealed importance sampling along tempered chain | Neal 2001 |
| Bayesian model comparison | PSIS-LOO leave-one-out cross-validation | Vehtari, Gelman, Gabry 2017 |
| Particle filters | SIR/bootstrap filter for state-space models | Gordon, Salmond and Smith 1993 |
| Computer graphics rendering | Multiple importance sampling for path tracing | Veach and Guibas 1995 |
| Counterfactual policy evaluation | Inverse propensity scoring for ad/recommendation systems | Bottou et al. 2013 |
| Causal inference | Inverse probability of treatment weighting | Horvitz and Thompson 1952 |
| Survey statistics | Sampling weights to correct stratified surveys | Hansen and Hurwitz 1943 |
| Likelihood-free inference (ABC) | Reweight simulations by approximate likelihood | Sisson et al. 2007 |
| Diffusion model evaluation | IWAE-style upper bound on negative log-likelihood | Song et al. 2021 |
| Neural rendering and inverse graphics | Multiple importance sampling within differentiable renderers | Müller et al. 2019 |
Importance sampling is unforgiving when the proposal is wrong. The classical failure modes are:
Three diagnostics are standard practice:
loo and the Python package arviz.Importance sampling is built into most probabilistic programming languages and reinforcement learning frameworks:
| System | Where importance sampling appears |
|---|---|
| PyMC | SMC sampler, variational SVGD, PSIS-LOO |
| Pyro / NumPyro | infer.SMCFilter, infer.MCMC, IWAE example |
| TensorFlow Probability | tfp.mcmc, tfp.experimental.distribute, importance-weighted ELBO |
| Edward2 | Importance-weighted training and evaluation |
Stan + loo | PSIS leave-one-out cross-validation |
| Stable-Baselines3 | PPO clipped objective, off-policy SAC importance corrections |
| RLlib | IMPALA V-trace, PPO clipping |
| Tianshou | PPO and off-policy actor-critic algorithms with IS corrections |
| Open Source Image rendering (Mitsuba, PBRT) | Multiple importance sampling in path tracers |
Research interest in importance sampling has stayed steady because the technique sits at the intersection of so many fields. A few directions stand out in the past decade.
Pareto-smoothed importance sampling. Vehtari et al.'s PSIS gave the field a practical, sample-size-aware diagnostic and is now the default for Bayesian leave-one-out cross-validation.
Differentiable Monte Carlo. Treating importance-weighted estimators as differentiable graphs (with the reparameterisation trick where possible) lets practitioners back-propagate through Monte Carlo objectives. This is the technical core of the IWAE bound, of differentiable rendering, and of recent work on differentiable annealed importance sampling.
Off-policy evaluation in offline reinforcement learning. The shift from online RL to offline RL on logged data made off-policy evaluation a first-class problem. Self-normalised importance sampling, doubly robust estimators, and clipped IPS are routine in the offline-RL literature.
LLM evaluation and inference-time scaling. Importance sampling has become a core tool for evaluating large language models on rare prompts, for re-weighting samples from proposal LLMs to estimate behaviour under a target policy (used in distillation and red-teaming), and for re-scoring beam search candidates with auxiliary models.
Neural proposals. Normalising-flow and energy-based models are now used as learned importance-sampling proposals in physics simulation and lattice quantum chromodynamics, where a hand-designed $q$ would be hopeless.
| Method | Assumptions on target | When to use | Cost per sample | Output |
|---|---|---|---|---|
| Vanilla Monte Carlo | Can sample directly from $p$ | $p$ is easy to draw from and integrand is well-behaved | Very low | i.i.d. samples from $p$, unbiased $\hat\mu$ |
| Importance sampling | Can evaluate $p$ (up to constant) and have a covering proposal $q$ | Sampling from $p$ is hard, or $f$ concentrates in a rare region | Low | Weighted samples, unbiased (or consistent) $\hat\mu$ |
| Markov Chain Monte Carlo (Metropolis–Hastings, Gibbs) | Can evaluate $\tilde p$ up to constant | Posterior sampling, unimodal or moderately multimodal targets | Moderate, correlated samples | Approximate samples after burn-in |
| Hamiltonian Monte Carlo | $\nabla\log\tilde p$ available | Continuous high-dimensional posteriors | Higher per step, much better mixing | Approximate samples with low autocorrelation |
| Sequential Monte Carlo / particle filters | State-space structure or sequence of intermediate targets | Nonlinear, non-Gaussian filtering, evidence estimation | Moderate, parallel | Particle approximation of $p_t$, evidence estimate |
| Variational inference | Choose a tractable family $q_\phi$ | Speed-critical Bayesian inference, large-scale latent-variable models | Optimisation cost | Parametric approximation $q_\phi$ |
| Annealed importance sampling | Tempered chain $p_0,\ldots,p_n$ | Marginal likelihood / evidence estimation | Higher (chain of MCMC steps) | Unbiased weight, evidence estimate |
While precursors of importance sampling appear in 1949 statistical-physics work by Kahn and Marshall (and in Hansen and Hurwitz's 1943 stratified survey weights), the explicit formulation as a variance-reduction technique is generally credited to the early Los Alamos Monte Carlo group. Hammersley and Handscomb's 1964 Monte Carlo Methods (Methuen Monographs on Applied Statistics and Probability) gave the first textbook treatment, and is still cited as the canonical reference for the classical results. Teun Kloek and Herman van Dijk's 1978 Econometrica paper introduced importance sampling to Bayesian econometrics. Rubinstein's 1981 Simulation and the Monte Carlo Method and Robert and Casella's 2004 Monte Carlo Statistical Methods (especially Chapter 3) are the standard graduate references in statistics; Bishop's Pattern Recognition and Machine Learning (2006) and Murphy's Machine Learning: A Probabilistic Perspective (2012) and Probabilistic Machine Learning (2022) cover the technique for ML audiences. Owen's online textbook Monte Carlo theory, methods and examples (2013, in progress) gives the most complete modern variance-reduction treatment, with Chapter 9 devoted entirely to importance sampling. The reinforcement-learning treatment is concentrated in Sutton and Barto's Reinforcement Learning: An Introduction (2018, 2nd edition), Chapters 5 and 7.