12  Variance Reduction

Published

April 1, 2025

12.0.1 Markov Chain Monte Carlo (MCMC)

When the random variables \(X_i\) are not independent, the variance of the sum is given by

\[ \text{Var}(\hat{\ell}) = \frac{1}{N^2} \sum_{i=1}^{N} \text{Var}(f(X_i)) + \frac{2}{N^2} \sum_{i=1}^{N} \sum_{j=1, j \neq i}^{N} \text{Cov}(f(X_i), f(X_j)). \]

There are \(n^2\) terms in the covariance sum, and in general, we cannot guarantee that the covariance is small. So the above confidence interval is not valid. In the case of ergodic Markov chains, the central limit theorem for Markov chains allows us to estimate the CI as \[ \left[ \hat{\ell} - z_{1-\alpha/2} \frac{S_{eff}}{\sqrt{N_{eff}}}, \hat{\ell} + z_{1-\alpha/2} \frac{S_{eff}}{\sqrt{N_{eff}}} \right], \] where \(N_{eff}\) is the effective sample size. We will not discuss the details of effective sample size here.

Example: Estimating \(\pi\). In the first example we saw for estimating \(\pi\), we used the Monte Carlo method to estimate \(\pi\) by simulating random points in a square and counting the number of points that fall within a circle. The estimator we used was

\[ \hat{\ell} = \frac{4}{N} \sum_{i=1}^{N} I(X_i), \]

where \(I(X_i)\) is an indicator function that is 1 if the point \(X_i\) is inside the circle and 0 otherwise. The variance of this estimator is given by \[ \text{Var}(\hat{\ell}) = \frac{16}{N} \left( \frac{\pi}{4} \left( 1 - \frac{\pi}{4} \right) \right). \]

Example: Estimation of Rare-Event Probabilities. Consider estimation of the tail probability \(\ell = P(X > \gamma)\) of some random variable \(X\) for a large number \(\gamma\). We can use the following estimator: \[ \hat{\ell} = \frac{1}{N} \sum_{i=1}^{N} I_{> \gamma}(X_i), \] where \(I_{> \gamma}(X_i)\) is an indicator function that is 1 if \(X_i > \gamma\) and 0 otherwise. The variance of this estimator is given by

\[ \text{Var}(\hat{\ell}) = \frac{1}{N} \left( \ell (1 - \ell) \right). \] The relative width of the confidence interval is given by \[ \text{Relative Width} = \frac{2z_{1-\alpha/2} \sqrt{\ell (1 - \ell)}}{\hat{\ell} \sqrt{N}} \approx \frac{2z_{1-\alpha/2}}{\sqrt{N}} \sqrt{\frac{1 - \ell}{\ell}} \approx \frac{2z_{1-\alpha/2}}{\sqrt{N \ell}} \] When \(\ell\) is small, the relative width of the confidence interval is large. This means that we need a large number of samples to get a good estimate of \(\ell\).

Example: Magnetization in a 2D Ising Model. In a previous worksheet, we used Gibbs sampling and Metropolis–Hastings sampling to estimate the magnetization of a 2D Ising model. We used the simple estimator \[ \hat{M} = \frac{1}{N} \sum_{i=1}^{N} M(\sigma_i), \] where \(M(\sigma_i)\) is the magnetization of the \(i\)-th sample.

The samples \(\sigma_i\) are not independent, and so the variance of the estimator will be

\[ \text{Var}(\hat{M}) = \frac{1}{N^2} \sum_{i=1}^{N} \text{Var}(M(\sigma_i)) + \frac{2}{N^2} \sum_{i=1}^{N} \sum_{j=1, j \neq i}^{N} \text{Cov}(M(\sigma_i), M(\sigma_j)), \] which leads to much larger confidence intervals than we would expect from independent samples.

12.1 Importance Sampling

Importance sampling is a method to reduce the variance of an estimator by changing the distribution from which we sample. The idea is to reduce the number of low probability events that contribute to the variance of the estimator.

For example, when estimating the tail probability \(\ell = P(X > \gamma)\), if \(\ell\) is small, then most of the samples will be in the region \(X \leq \gamma\), which contributes little to the estimate. However, we cannot just sample from the tail of the distribution as this would provide us no information about the rest of the distribution. Importance sampling allows us to sample from tail but then “fix” the estimate by weighting the samples appropriately.

Definition: Importance Sampling. Let \(X\) be a random variable with probability density function (pdf) \(p(x)\), and let \(q(x)\) be a proposal pdf such that \(q(x) > 0\) for all \(x\) in the support of \(p(x)\). The importance sampling estimator of \(\ell = E[f(X)]\) is given by \[ \hat{\ell} = \frac{1}{N} \sum_{i=1}^{N} f(X_i) \frac{p(X_i)}{q(X_i)}, \] where \(X_1, X_2, \ldots, X_N\) are i.i.d. samples from the distribution with pdf \(q(x)\).

For clarity, we’ll denote the estimator in ?eq-crude-estimator as \(\hat{\ell}_{crude}\) and the importance sampling estimator as \(\hat{\ell}_{IS}\).

Suppose \(N = 1\) so that the estimator is given by \[ \hat{\ell}_{IS} = f(X) \frac{p(X)}{q(X)}. \] Note that here \(X \sim q(x)\) and NOT \(p(x)\). We can check that this estimator is unbiased: \[ \begin{aligned} E_q[\hat{\ell}_{IS}] &= E_q\left[f(X) \frac{p(X)}{q(X)}\right] \\ &= \int f(x) \frac{p(x)}{q(x)} q(x) dx \\ &= \int f(x) p(x) dx \\ &= E[f(X)] \\ &= \ell. \end{aligned} \]

However, \[ \begin{aligned} \text{Var}(\hat{\ell}_{IS}) &\neq \text{Var}(\hat{\ell}_{crude}). \end{aligned} \]

This allows us to reduce the variance of the estimator by choosing \(q(x)\) appropriately.

Suppose \(f(X)\) is a non-negative function. Then if we choose

\[ q(x) \propto p(x) f(x), \]

then the importance sampling estimator for \(N = 1\)

\[ \hat{\ell}_{IS} = f(X) \frac{p(X)}{q(X)} \]

is a constant and has zero variance! When \(H\) is not non-negative, we can show that

\[ q(x) \propto p(x) |f(x)| \]

minimizes the variance of the estimator \(\hat{\ell}_{IS}\).

However, note that our goal is to estimate \(\ell = E[f(X)]\), which means that we do not know \(f(x)\) in advance. So we cannot choose this \(q(x)\) in advance. Even if we could, we might not be able to sample from \(q(x)\) easily. In practice, we choose \(q(x)\) to be a distribution that is easy to sample from and that is “close” to \(p(x)\) in some sense.

Example: Importance Sampling for Rare Events. Consider the problem of estimating the tail probability \(\ell = P(X > \gamma)\) for a random variable \(X\) with standard normal distribution. We can use importance sampling to estimate this probability by choosing a proposal distribution \(q(x)\) that is concentrated in the tail region. One such distribution is the exponential distribution with parameter \(\lambda\), which has pdf \[ q(x) = \lambda e^{-\lambda (x-2)}, \quad x \ge 2. \] The plots below show the running averages of the crude and importance sampling estimators for \(N = 2000\) samples. The importance sampling estimator is much more stable and converges to the true value of \(\ell\) much faster than the crude estimator.

Variance of Importance Sampling Estimate: 0.00000
Relative Error of Importance Sampling Estimate: 0.01247
Variance of Crude Monte Carlo Estimate: 0.00001
Relative Error of Crude Monte Carlo Estimate: 0.09231

12.1.1 Remarks

  1. The optimal choice of the proposal distribution \(q(x)\) is not always easy to find. Even if we can find it, we may not be able to sample from it easily. For importance sampling algorithm, we need to be able to sample from \(q(x)\). Often, we use a distribution that is easy to sample from and that is “close” to \(|f(x)|p(x)\) in some sense.
  2. Unlike rejection sampling and MCMC methods, for importance sampling we need to know the normalizing constant of the proposal distribution \(q(x)\) in order to compute the weights. This means that we are fairly limited in the choice of \(q(x)\). Some common choices are the exponential distribution, the normal distribution, and the uniform distribution, and a mixture of these distributions.
  3. In order for the estimator to be well-defined, we need to ensure that \(q(x) > 0\) for all \(x\) in the support of \(f(x) p(x)\). As in the case of rare-event estimation, the support of \(f(x) p(x)\) may be very small compared to the support of \(p(x)\). We only need \(q(x)\) to be positive in this smaller region.

12.2 Antithetic and Control Random Variates

In this section, we will discuss two methods of variance reduction that are based on the idea of using correlated random variables: antithetic variables and control variates. Recall that the variance of a sum of two random variables is given by

\[ \text{var}(X + Y) = \text{var}(X) + \text{var}(Y) + 2\text{cov}(X, Y). \]

Oftentimes, having correlated random variables in undesirable as it reduces to an increase in variance and a decrease in the effective sample size. However, in some cases, we can use this correlation to our advantage.

12.2.1 Antithetic Variates

Antithetic variates are pairs of random variables that are negatively correlated. If we can find an estimator that uses sums to two antithetic random variables, we can reduce its variance.

Consider the example of estimating the integral from ?sec-estimating-integrals. Suppose \(f(x)\) is a monotonic function over \([a, b]\). Then you’ll show on the homework that if \(X \sim U(a, b)\) then

\[ \text{cov}(f(X), f(a + b - X)) \le 0. \]

We can see intuitively why this is happening - if \(f(x)\) is increasing the \(f(b - x)\) is decreasing and vice versa, and hence the two are negatively correlated. In this case, we can reduce the variance of the crude estimator by instead using

\[ \hat{\ell}_{anti} = \dfrac{(b - a)}{N} \sum \limits_{i = 1}^{2N} \left(f(X) + f(a + b - X)\right) \]

Note that if \(X \sim U(a, b)\) then so is \(b - X\) and so \(\hat{\ell}_{anti}\) is an unbiased estimator.

Theorem 12.1 \(\text{var}(\hat{\ell}_{anti}) \le \text{var}(\hat{\ell}_{crude})\).

Example 12.1 Example: Antithetic Variates. Consider the problem of estimating the integral \(\ell = \int_0^1 (1 + x^2)^{-1} dx\) using antithetic variates. The plots below show the running averages of the crude and antithetic variate estimators for \(N = 100\) samples. The antithetic variate estimator achieves a \(50x\) reduction in variance compared to the crude estimator.

Standard Monte Carlo Estimate: 0.770064, Variance: 1.180981e-04
Antithetic Variates Estimate: 0.787292, Variance: 1.917273e-06
Variance Reduction Factor: 61.60x

12.2.2 Control Variates

Control random variables are examples of random variables that are positively correlated. In this case, the difference

\[ \text{var}(X - Y) = \text{var}(X) + \text{var}(Y) - 2 \text{cov}(X, Y) \]

will have lower variance. Let \(X \sim p(x)\). Suppose we want to estimate \(\ell = \mathbb{E}[f(x)]\) for some function \(f(x)\). We can use a control variate \(Y = h(X)\) for some function \(h(x)\) such that

  1. \(\mathbb{E}[h(X)]\) is known, say \(\mathbb{E}[h(X)] = h_0\).
  2. \(h(x)\) is strongly positively correlated with \(f(x)\), i.e., \(\text{cov}(f(X), h(X)) \gg 0\).

Then we can use the control variate estimator

\[ \hat{\ell}_\text{CV} = \frac{1}{n} \sum_{i=1}^n \left[f(X_i) - \beta (h(X_i) - h_0)\right] \]

where \(\beta\) is a constant. It is easy to see that \(\hat{\ell}_\text{CV}\) is an unbiased estimator of \(\ell\). The variance of the control variate estimator is given by

\[ \begin{aligned} \text{var}(\hat{\ell}_\text{CV}) &= \frac{1}{n} \left[\text{var}(f(X)) + \beta^2 \text{var}(h(X)) - 2 \beta \text{cov}(f(X), h(X))\right] \\ &= \frac{1}{n} \left[\text{var}(f(X)) + \beta \text{var}(h(X)) \left[ \beta - 2\frac{\text{cov}(f(X), h(X))}{\text{var}(h(X))}\right]\right] \end{aligned} \]

By choosing \(\beta < 2\frac{\text{cov}(f(X), h(X))}{\text{var}(h(X))}\), we can reduce the variance of the control variate estimator. In practice, it is not easy to calculate the covariance between \(f(X)\) and \(h(X)\), so we

  1. Pick a control variate \(h(X)\) that is strongly correlated with \(f(X)\), and whose expectation is known.
  2. Experiment with different values of \(\beta\) to find the one that minimizes the variance of the control variate estimator.

Example 12.2 In the example below, we estimate the integral of \(x e^{-x}\) over the interval \([0, 1]\) using control function \(g(x) = x\). We know that \(\mathbb{E}[g(X)] = \frac{1}{2}\) so the estimator is given by \[ \hat{\ell}_\text{CV} = \frac{1}{n} \sum_{i=1}^n \left[f(X_i) - \beta (g(X_i) - \frac{1}{2})\right] \] where \(\beta\) is a constant and \(X_i \sim \text{Uniform}(0, 1)\) are i.i.d. We choose \(\beta \approx 0.35\) which minimizes the variance of the control variate estimator. This results in an 8-fold reduction in variance compared to the naive estimator.