Resampling and Monte Carlo Methods

Sources:

# Manipulate data
import numpy as np
import pandas as pd

# Statistics
import scipy.stats
import statsmodels.api as sm
#import statsmodels.stats.api as sms
import statsmodels.formula.api as smf
from statsmodels.stats.stattools import jarque_bera

# Plot
import matplotlib.pyplot as plt
import seaborn as sns
import pystatsml.plot_utils

# Plot parameters
plt.style.use('seaborn-v0_8-whitegrid')
fig_w, fig_h = plt.rcParams.get('figure.figsize')
plt.rcParams['figure.figsize'] = (fig_w, fig_h * .5)
%matplotlib inline

Monte-Carlo simulation of Random Walk Process

One-dimensional random walk (Brownian motion)

More information: Random Walks, Central Limit Theorem

At each step \(i\) the process moves with +1 or -1 with equal probability, ie, \(X_i \in \{+1, -1\}\) with \(P(X_i=+1)=P(X_i=-1)=1/2\). Steps \(X_i\)’s are i.i.d..

Let \(S_n = \sum^n_i X_i\), or \(S_i\) (at time \(i\)) is \(S_i = S_{i-1} + X_i\)

Realizations of random walks obtained by Monte Carlo simulation Plot Few random walks (trajectories), ie, \(S_n\) for \(n=0\) to \(200\)

np.random.seed(seed=42)  # make the example reproducible

n = 200 # trajectory depth
nsamp = 50000 #nb of trajectories

# X: each row (axis 0) contains one trajectory axis 1
#Xn = np.array([np.random.choice(a=[-1, +1], size=n,
#                                replace=True, p=np.ones(2) / 2)
#               for i in range(nsamp)])

Xn = np.array([np.random.choice(a=np.array([-1, +1]), size=n,
                                replace=True, p=np.ones(2)/2)
               for i in range(nsamp)])

# Sum of random walks (trajectories)
Sn = Xn.sum(axis=1)

print("True Stat. Mean={:.03f}, Sd={:.02f}".\
    format(0, np.sqrt(n) * 1))

print("Est. Stat. Mean={:.03f}, Sd={:.02f}".\
    format(Sn.mean(), Sn.std()))
True Stat. Mean=0.000, Sd=14.14
Est. Stat. Mean=0.010, Sd=14.09

Plot cumulative sum of 100 random walks (trajectories)

Sn_traj = Xn[:100, :].cumsum(axis=1)
_ = pd.DataFrame(Sn_traj.T).plot(legend=False)
../_images/stat_montecarlo_5_0.png

Distribution of \(S_n\) vs \(\mathcal{N}(0, \sqrt(n))\)

x_low, x_high = Sn.mean()-3*Sn.std(), Sn.mean()+3*Sn.std()
h_ = plt.hist(Sn, range=(x_low, x_high), density=True, bins=43, fill=False,
             label="Histogram")

x_range = np.linspace(x_low, x_high, 30)
prob_x_range = scipy.stats.norm.pdf(x_range, loc=Sn.mean(), scale=Sn.std())
plt.plot(x_range, prob_x_range, 'r-', label="PDF: P(X)")
_ = plt.legend()
#print(h_)
../_images/stat_montecarlo_7_0.png

Permutation Tests

Permutation test:

  • The test involves two or more samples assuming that values can be randomly permuted under the null hypothesis.

  • The test is Resampling procedure to estimate the distribution of a parameter or any statistic under the null hypothesis, i.e., calculated on the permuted data. This parameter or any statistic is called the estimator.

  • Statistical inference is conducted by computing the proportion of permuted values of the estimator that are “more extreme” than the true one, providing an estimation of the p-value.

  • Permutation tests are a subset of non-parametric statistics, useful when the distribution of the estimator (under H0) is unknown or requires complicated formulas.

Permutation test procedure

Bootstrapping procedure

Bootstrapping procedure

  1. Estimate the observed parameter or statistic \(\hat{\theta} = S(X)\) on the initial dataset \(X\) of size \(N\). We call it the observed statistic.

  2. Generate \(R\) samples (called randomized samples) \([X_1, \ldots X_r, \ldots X_R]\) from the initial dataset by permutation of the values between the two samples.

  3. Distribution of the estimator under HO: For each random sample \(r\), compute the estimator \(\hat{\theta}_r = S(X_r)\). The set of \(\{\hat{\theta}_r\}\) provides an estimate the distribution of \(P(\theta | H0)\) (under the null hypothesis).

  4. Compute statistics of the estimator under the null hypothesis using randomized estimates \(\hat{\theta}_r\)’s:

  • Mean (under \(H0\)):

    \[\bar{\theta}_R = \frac{1}{r}\sum_{r=1}^R \hat{\theta}_r\]
  • Standard Error \(\hat{SE}_{\theta_R}\) (under \(H0\)):

    \[\hat{SE}_{\theta_R} = \sqrt{\frac{1}{R-1}\sum_{r=1}^K (\bar{\theta}_R - \hat{\theta}_r)^2}\]
  • Compute p-value using the distribution (under \(H0\)):

    • One-sided p-value:

      \[P(\theta \ge \hat{\theta} | H0) \approx \frac{\mathbf{card}(\hat{\theta}_r \ge \hat{\theta})}{R}\]
    • Two-sided p-value:

      \[P(\theta \le \hat{\theta}~\text{or}~\theta \ge \hat{\theta} | H0) \approx \frac{\mathbf{card}(\hat{\theta}_r \le \hat{\theta}) + \mathbf{card}(\hat{\theta}_r \ge \hat{\theta})}{R}\]
def randomize(x, estimator, n_perms=1000):
    """_summary_

    Parameters
    ----------
    x : array
        the datasets
    estimator : callable
        the estimator function taking x as argument returning the estimator value
        (scalar)
    n_perms : int, optional
        the number of permutation, by default 1000

    Return
    ------
    Observed estimate
    Mean of randomized estimates
    Standard Error of randomized estimates
    Two-sided p-value
    Randomized distribution estimates (density_values, bins)
    """

    # 1. Estimate the observed parameter or statistic
    theta = estimator(x)

    # 2. Permuted samples
    # Randomly pick the sign with the function:
    # np.random.choice([-1, 1], size=len(x), replace=True)

    x_R = [np.random.choice([-1, 1], size=len(x), replace=True) * x
        for boot_i in range(n_perms)]

    # 3. Distribution of the parameter or statistic under H0

    thetas_R = np.array([estimator(x_r) for x_r in x_R])
    theta_density_R, bins = np.histogram(thetas_R, bins=50, density=True)
    dx = np.diff(bins)

    # 4. Randomized Statistics

    # Mean of randomized estimates
    theta_mean_R = np.mean(thetas_R)

    # Standard Error of randomized estimates
    theta_se_R = np.sqrt(1 / (n_perms - 1) *
                         np.sum((theta_mean_R - thetas_R) ** 2))

    # 4. Compute two-sided p-value using the distribution under H0:

    extream_vals = (thetas_R <= -np.abs(theta)) | (thetas_R >= np.abs(theta))
    pval = np.sum(extream_vals) / n_perms
    # We could use:
    # (np.sum(thetas_R <= -np.abs(theta)) + \
    #  np.sum(thetas_R >=  np.abs(theta))) / n_perms

    return theta, theta_mean_R, theta_se_R, pval, (theta_density_R, bins)

Example, we load the monthly revenue figures of for 100 stores before and after a marketing campaigns. We compute the difference (\(x_i = x_i^\text{after} - x_i^\text{before}\)) for each store \(i\). Under the null hypothesis, i.e., no effect of the campaigns, \(x_i^\text{after}\) and \(x_i^\text{before}\) could be permuted, which is equivalent to randomly switch the sign of \(x_i\). Here we will focus on the sample mean \(\hat{\theta} = S(X) = 1/n\sum_i x_i\) as statistic of interest.

df = pd.read_csv("../datasets/Monthly Revenue (in thousands).csv")
# Reshape Keep only the 30 first samples
df = df.pivot(index='store_id', columns='time', values='revenue')[:30]
df.after -= 3 # force to smaller effect size
x = df.after - df.before

plt.hist(x, fill=False)
plt.axvline(x=0, color="b", label=r'No effect: 0')
plt.axvline(x=x.mean(), color="r", ls='-', label=r'$\bar{x}=%.2f$' % x.mean())
plt.legend()
_ = plt.title(r'Distribution of the sales changes $x_i = x_i^\text{after} - \
    x_i^\text{before}$')
../_images/stat_montecarlo_12_0.png
np.random.seed(15) # set the seed for reproducible results

theta, theta_mean_R, theta_se_R, pval, (theta_density_R, bins) = \
    randomize(x=x, estimator=np.mean, n_perms=10000)

print("Estimate: {:.2f}, Mean(Estimate|HO): {:.4f}, p-val: {:e}, SE: {:.3f}".\
    format(theta, theta_mean_R, pval, theta_se_R))
Estimate: 2.26, Mean(Estimate|HO): -0.0103, p-val: 2.430000e-02, SE: 1.018

Plot

pystatsml.plot_utils.plot_pvalue_under_h0(stat_vals=bins[1:], stat_probs=theta_density_R,
                     stat_obs=theta,  stat_h0=0, bar_width=np.diff(bins),
                     thresh_low=-np.abs(theta), thresh_high=np.abs(theta))
_ = plt.title(r'Randomized distribution of the estimator $\hat{\theta}_b$ (sample mean)')
../_images/stat_montecarlo_15_0.png

Similar procedure can be conducted with many statistic e.g., the t-statistic (same results):

def ttstat(x):
    return (np.mean(x) - 0) / np.std(x, ddof=1) * np.sqrt(len(x))


np.random.seed(15) # set the seed for reproducible results
theta, theta_mean_R, theta_se_R, pval, (theta_density_R, bins) = \
    randomize(x=x, estimator=ttstat, n_perms=10000)

print("Estimate: {:.2f}, Mean(Estimate|HO): {:.4f}, p-val: {:e}, SE: {:.3f}".\
    format(theta, theta_mean_R, pval, theta_se_R))
Estimate: 2.36, Mean(Estimate|HO): -0.0106, p-val: 2.430000e-02, SE: 1.025

Or the median:

np.random.seed(15) # set the seed for reproducible results
theta, theta_mean_R, theta_se_R, pval, (theta_density_R, bins) = \
    randomize(x=x, estimator=np.median, n_perms=10000)

print("Estimate: {:.2f}, Mean(Estimate|HO): {:.4f}, p-val: {:e}, SE: {:.3f}".\
    format(theta, theta_mean_R, pval, theta_se_R))
Estimate: 1.85, Mean(Estimate|HO): -0.0144, p-val: 9.370000e-02, SE: 1.066

Bootstrapping

Bootstrapping:

  • Resampling procedure to estimate the distribution of a statistic or parameter of interest, called the estimator.

  • Derive estimates of variability for estimator (bias, standard error, confidence intervals, etc.).

  • Statistical inference is conducted by looking the confidence interval (CI) contains the null hypothesis.

  • Nonparametric approach statistical inference, useful when model assumptions is in doubt, unknown or requires complicated formulas.

  • Bootstrapping with replacement has favorable performances (Efron 1979, and 1986) compared to prior methods like the jackknife that sample without replacement

  • Regularize models by fitting several models on bootstrap samples and averaging their predictions (see Baging and random-forest). See machine learning chapter.

Note that compared to random permutation, bootstrapping sample the distribution under the alternative hypothesis, it doesn’t consider the distribution under the null hypothesis. A great advantage of bootstrap is its simplicity of the procedure:

Bootstrapping procedure

Bootstrapping procedure

  1. Compute the estimator \(\hat{\theta} = S(X)\) on the initial dataset \(X\) of size \(N\).

  2. Generate \(B\) samples (called bootstrap samples) \([X_1, \ldots X_b, \ldots X_B]\) from the initial dataset by randomly drawing with replacement \(N\) observations.

  3. For each sample \(b\) compute the estimator \(\hat{\theta}_b = S(X_b)\), which provides the bootstrap distribution \(P_{\theta_B}\) of the estimator.

  4. Compute statistics of the estimator on bootstrap estimates \(\hat{\theta}_b\)’s:

  • Bootstrap estimate (of the parameters):

    \[\bar{\theta}_B = \frac{1}{B}\sum_{b=1}^B \hat{\theta}_b\]
  • Bias = bootstrap estimate - estimate:

    \[\hat{b}_{\theta_B} = \bar{\theta}_B - \hat{\theta}\]
  • Standard error \(\hat{S}_{\theta_B}\):

    \[\hat{S}_{\theta_B} = \sqrt{\frac{1}{B-1}\sum_{b=1}^B (\bar{\theta}_B - \hat{\theta}_b)^2}\]
  • Confidence interval using the estimated bootstrapped distribution of the estimator:

    \[CI_{95\%}=[\hat{\theta}_1=Q(2.5\%), \hat{\theta}_2=Q(97.5\%)]~\text{i.e., the 2.5\%, 97.5\% quantiles estimators out of the}~\hat{\theta}_b's\]

Application using the monthly revenue of 100 stores before and after a marketing campaigns, using the difference (\(x_i = x_i^\text{after} - x_i^\text{before}\)) for each store \(i\). If the average difference \(\bar{x}=1/n \sum_i x_i\) is positive (resp. negative), then the marketing campaigns will be considered as efficient (resp. detrimental). We will use bootstrapping to compute the confidence interval (CI) and see if 0 in comprised in the CI.

x = df.after - df.before
S = np.mean

# 1. Model parameters
theta_hat = S(x)

np.random.seed(15) # set the seed for reproducible results
B = 1000 # Number of bootstrap

# 2. Bootstrapped samples

x_B = [np.random.choice(x, size=len(x), replace=True) for boot_i in range(B)]

# 3. Bootstrap estimates and distribution

theta_hats_B = np.array([S(x_b) for x_b in x_B])
theta_density_B, bins = np.histogram(theta_hats_B, bins=50, density=True)
dx = np.diff(bins)

# 4. Bootstrap Statistics

# Bootstrap estimate
theta_bar_B = np.mean(theta_hats_B)

# Bias = bootstrap estimate - estimate
bias_hat_B = theta_bar_B - theta_hat

# Standard Error
se_hat_B = np.sqrt(1 / (B - 1) * np.sum((theta_bar_B - theta_hats_B) ** 2))

# Confidence interval using the estimated bootstrapped distribution of estimator

ci95 = np.quantile(a=theta_hats_B, q=[0.025, 0.975])

print(
    "Est.: {:.2f}, Boot Est.: {:.2f}, bias: {:e},\
     Boot SE: {:.2f}, CI: [{:.5f}, {:.5f}]"\
    .format(theta_hat, theta_bar_B, bias_hat_B, se_hat_B, ci95[0], ci95[1]))
Est.: 2.26, Boot Est.: 2.19, bias: -7.201946e-02,     Boot SE: 0.95, CI: [0.45256, 4.08465]

Conclusion: Zero is outside the CI, moreover \(\bar{X}\) is positive. Thus we can conclude the marketing campaign produced a significant increase of the sales.

Plot

plt.bar(bins[1:], theta_density_B, width=dx, fill=False, label=r'$P_{\theta_B}$')
plt.axvline(x=theta_hat, color='r', ls='--', lw=2, label=r'$\hat{\theta}$ (Obs. Stat.)')
plt.axvline(x=theta_bar_B, color="b", ls='-', label=r'$\bar{\theta}_B$')
plt.axvline(x=ci95[0], color="b", ls='--', label=r'$CI_{95\%}$')
plt.axvline(x=ci95[1], color="b", ls='--')
plt.axvline(x=0, color="k", lw=2, label=r'No effect: 0')
plt.legend()
_ = plt.title(r'Bootstrap distribution of the estimator $\hat{\theta}_b$ (sample mean)')
../_images/stat_montecarlo_25_0.png