Resampling and Monte Carlo Methods¶
Sources:
# Manipulate data
import numpy as np
import pandas as pd
# Statistics
import scipy.stats
import statsmodels.api as sm
#import statsmodels.stats.api as sms
import statsmodels.formula.api as smf
from statsmodels.stats.stattools import jarque_bera
# Plot
import matplotlib.pyplot as plt
import seaborn as sns
import pystatsml.plot_utils
# Plot parameters
plt.style.use('seaborn-v0_8-whitegrid')
fig_w, fig_h = plt.rcParams.get('figure.figsize')
plt.rcParams['figure.figsize'] = (fig_w, fig_h * .5)
%matplotlib inline
Monte-Carlo simulation of Random Walk Process¶
One-dimensional random walk (Brownian motion)¶
More information: Random Walks, Central Limit Theorem
At each step \(i\) the process moves with +1 or -1 with equal probability, ie, \(X_i \in \{+1, -1\}\) with \(P(X_i=+1)=P(X_i=-1)=1/2\). Steps \(X_i\)’s are i.i.d..
Let \(S_n = \sum^n_i X_i\), or \(S_i\) (at time \(i\)) is \(S_i = S_{i-1} + X_i\)
Realizations of random walks obtained by Monte Carlo simulation Plot Few random walks (trajectories), ie, \(S_n\) for \(n=0\) to \(200\)
np.random.seed(seed=42) # make the example reproducible
n = 200 # trajectory depth
nsamp = 50000 #nb of trajectories
# X: each row (axis 0) contains one trajectory axis 1
#Xn = np.array([np.random.choice(a=[-1, +1], size=n,
# replace=True, p=np.ones(2) / 2)
# for i in range(nsamp)])
Xn = np.array([np.random.choice(a=np.array([-1, +1]), size=n,
replace=True, p=np.ones(2)/2)
for i in range(nsamp)])
# Sum of random walks (trajectories)
Sn = Xn.sum(axis=1)
print("True Stat. Mean={:.03f}, Sd={:.02f}".\
format(0, np.sqrt(n) * 1))
print("Est. Stat. Mean={:.03f}, Sd={:.02f}".\
format(Sn.mean(), Sn.std()))
True Stat. Mean=0.000, Sd=14.14
Est. Stat. Mean=0.010, Sd=14.09
Plot cumulative sum of 100 random walks (trajectories)
Sn_traj = Xn[:100, :].cumsum(axis=1)
_ = pd.DataFrame(Sn_traj.T).plot(legend=False)

Distribution of \(S_n\) vs \(\mathcal{N}(0, \sqrt(n))\)
x_low, x_high = Sn.mean()-3*Sn.std(), Sn.mean()+3*Sn.std()
h_ = plt.hist(Sn, range=(x_low, x_high), density=True, bins=43, fill=False,
label="Histogram")
x_range = np.linspace(x_low, x_high, 30)
prob_x_range = scipy.stats.norm.pdf(x_range, loc=Sn.mean(), scale=Sn.std())
plt.plot(x_range, prob_x_range, 'r-', label="PDF: P(X)")
_ = plt.legend()
#print(h_)

Permutation Tests¶
The test involves two or more samples assuming that values can be randomly permuted under the null hypothesis.
The test is Resampling procedure to estimate the distribution of a parameter or any statistic under the null hypothesis, i.e., calculated on the permuted data. This parameter or any statistic is called the estimator.
Statistical inference is conducted by computing the proportion of permuted values of the estimator that are “more extreme” than the true one, providing an estimation of the p-value.
Permutation tests are a subset of non-parametric statistics, useful when the distribution of the estimator (under H0) is unknown or requires complicated formulas.
Permutation test procedure

Bootstrapping procedure¶
Estimate the observed parameter or statistic \(\hat{\theta} = S(X)\) on the initial dataset \(X\) of size \(N\). We call it the observed statistic.
Generate \(R\) samples (called randomized samples) \([X_1, \ldots X_r, \ldots X_R]\) from the initial dataset by permutation of the values between the two samples.
Distribution of the estimator under HO: For each random sample \(r\), compute the estimator \(\hat{\theta}_r = S(X_r)\). The set of \(\{\hat{\theta}_r\}\) provides an estimate the distribution of \(P(\theta | H0)\) (under the null hypothesis).
Compute statistics of the estimator under the null hypothesis using randomized estimates \(\hat{\theta}_r\)’s:
Mean (under \(H0\)):
\[\bar{\theta}_R = \frac{1}{r}\sum_{r=1}^R \hat{\theta}_r\]Standard Error \(\hat{SE}_{\theta_R}\) (under \(H0\)):
\[\hat{SE}_{\theta_R} = \sqrt{\frac{1}{R-1}\sum_{r=1}^K (\bar{\theta}_R - \hat{\theta}_r)^2}\]Compute p-value using the distribution (under \(H0\)):
One-sided p-value:
\[P(\theta \ge \hat{\theta} | H0) \approx \frac{\mathbf{card}(\hat{\theta}_r \ge \hat{\theta})}{R}\]Two-sided p-value:
\[P(\theta \le \hat{\theta}~\text{or}~\theta \ge \hat{\theta} | H0) \approx \frac{\mathbf{card}(\hat{\theta}_r \le \hat{\theta}) + \mathbf{card}(\hat{\theta}_r \ge \hat{\theta})}{R}\]
def randomize(x, estimator, n_perms=1000):
"""_summary_
Parameters
----------
x : array
the datasets
estimator : callable
the estimator function taking x as argument returning the estimator value
(scalar)
n_perms : int, optional
the number of permutation, by default 1000
Return
------
Observed estimate
Mean of randomized estimates
Standard Error of randomized estimates
Two-sided p-value
Randomized distribution estimates (density_values, bins)
"""
# 1. Estimate the observed parameter or statistic
theta = estimator(x)
# 2. Permuted samples
# Randomly pick the sign with the function:
# np.random.choice([-1, 1], size=len(x), replace=True)
x_R = [np.random.choice([-1, 1], size=len(x), replace=True) * x
for boot_i in range(n_perms)]
# 3. Distribution of the parameter or statistic under H0
thetas_R = np.array([estimator(x_r) for x_r in x_R])
theta_density_R, bins = np.histogram(thetas_R, bins=50, density=True)
dx = np.diff(bins)
# 4. Randomized Statistics
# Mean of randomized estimates
theta_mean_R = np.mean(thetas_R)
# Standard Error of randomized estimates
theta_se_R = np.sqrt(1 / (n_perms - 1) *
np.sum((theta_mean_R - thetas_R) ** 2))
# 4. Compute two-sided p-value using the distribution under H0:
extream_vals = (thetas_R <= -np.abs(theta)) | (thetas_R >= np.abs(theta))
pval = np.sum(extream_vals) / n_perms
# We could use:
# (np.sum(thetas_R <= -np.abs(theta)) + \
# np.sum(thetas_R >= np.abs(theta))) / n_perms
return theta, theta_mean_R, theta_se_R, pval, (theta_density_R, bins)
Example, we load the monthly revenue figures of for 100 stores before and after a marketing campaigns. We compute the difference (\(x_i = x_i^\text{after} - x_i^\text{before}\)) for each store \(i\). Under the null hypothesis, i.e., no effect of the campaigns, \(x_i^\text{after}\) and \(x_i^\text{before}\) could be permuted, which is equivalent to randomly switch the sign of \(x_i\). Here we will focus on the sample mean \(\hat{\theta} = S(X) = 1/n\sum_i x_i\) as statistic of interest.
df = pd.read_csv("../datasets/Monthly Revenue (in thousands).csv")
# Reshape Keep only the 30 first samples
df = df.pivot(index='store_id', columns='time', values='revenue')[:30]
df.after -= 3 # force to smaller effect size
x = df.after - df.before
plt.hist(x, fill=False)
plt.axvline(x=0, color="b", label=r'No effect: 0')
plt.axvline(x=x.mean(), color="r", ls='-', label=r'$\bar{x}=%.2f$' % x.mean())
plt.legend()
_ = plt.title(r'Distribution of the sales changes $x_i = x_i^\text{after} - \
x_i^\text{before}$')

np.random.seed(15) # set the seed for reproducible results
theta, theta_mean_R, theta_se_R, pval, (theta_density_R, bins) = \
randomize(x=x, estimator=np.mean, n_perms=10000)
print("Estimate: {:.2f}, Mean(Estimate|HO): {:.4f}, p-val: {:e}, SE: {:.3f}".\
format(theta, theta_mean_R, pval, theta_se_R))
Estimate: 2.26, Mean(Estimate|HO): -0.0103, p-val: 2.430000e-02, SE: 1.018
Plot
pystatsml.plot_utils.plot_pvalue_under_h0(stat_vals=bins[1:], stat_probs=theta_density_R,
stat_obs=theta, stat_h0=0, bar_width=np.diff(bins),
thresh_low=-np.abs(theta), thresh_high=np.abs(theta))
_ = plt.title(r'Randomized distribution of the estimator $\hat{\theta}_b$ (sample mean)')

Similar procedure can be conducted with many statistic e.g., the t-statistic (same results):
def ttstat(x):
return (np.mean(x) - 0) / np.std(x, ddof=1) * np.sqrt(len(x))
np.random.seed(15) # set the seed for reproducible results
theta, theta_mean_R, theta_se_R, pval, (theta_density_R, bins) = \
randomize(x=x, estimator=ttstat, n_perms=10000)
print("Estimate: {:.2f}, Mean(Estimate|HO): {:.4f}, p-val: {:e}, SE: {:.3f}".\
format(theta, theta_mean_R, pval, theta_se_R))
Estimate: 2.36, Mean(Estimate|HO): -0.0106, p-val: 2.430000e-02, SE: 1.025
Or the median:
np.random.seed(15) # set the seed for reproducible results
theta, theta_mean_R, theta_se_R, pval, (theta_density_R, bins) = \
randomize(x=x, estimator=np.median, n_perms=10000)
print("Estimate: {:.2f}, Mean(Estimate|HO): {:.4f}, p-val: {:e}, SE: {:.3f}".\
format(theta, theta_mean_R, pval, theta_se_R))
Estimate: 1.85, Mean(Estimate|HO): -0.0144, p-val: 9.370000e-02, SE: 1.066
Bootstrapping¶
Resampling procedure to estimate the distribution of a statistic or parameter of interest, called the estimator.
Derive estimates of variability for estimator (bias, standard error, confidence intervals, etc.).
Statistical inference is conducted by looking the confidence interval (CI) contains the null hypothesis.
Nonparametric approach statistical inference, useful when model assumptions is in doubt, unknown or requires complicated formulas.
Bootstrapping with replacement has favorable performances (Efron 1979, and 1986) compared to prior methods like the jackknife that sample without replacement
Regularize models by fitting several models on bootstrap samples and averaging their predictions (see Baging and random-forest). See machine learning chapter.
Note that compared to random permutation, bootstrapping sample the distribution under the alternative hypothesis, it doesn’t consider the distribution under the null hypothesis. A great advantage of bootstrap is its simplicity of the procedure:

Bootstrapping procedure¶
Compute the estimator \(\hat{\theta} = S(X)\) on the initial dataset \(X\) of size \(N\).
Generate \(B\) samples (called bootstrap samples) \([X_1, \ldots X_b, \ldots X_B]\) from the initial dataset by randomly drawing with replacement \(N\) observations.
For each sample \(b\) compute the estimator \(\hat{\theta}_b = S(X_b)\), which provides the bootstrap distribution \(P_{\theta_B}\) of the estimator.
Compute statistics of the estimator on bootstrap estimates \(\hat{\theta}_b\)’s:
Bootstrap estimate (of the parameters):
\[\bar{\theta}_B = \frac{1}{B}\sum_{b=1}^B \hat{\theta}_b\]Bias = bootstrap estimate - estimate:
\[\hat{b}_{\theta_B} = \bar{\theta}_B - \hat{\theta}\]Standard error \(\hat{S}_{\theta_B}\):
\[\hat{S}_{\theta_B} = \sqrt{\frac{1}{B-1}\sum_{b=1}^B (\bar{\theta}_B - \hat{\theta}_b)^2}\]Confidence interval using the estimated bootstrapped distribution of the estimator:
\[CI_{95\%}=[\hat{\theta}_1=Q(2.5\%), \hat{\theta}_2=Q(97.5\%)]~\text{i.e., the 2.5\%, 97.5\% quantiles estimators out of the}~\hat{\theta}_b's\]
Application using the monthly revenue of 100 stores before and after a marketing campaigns, using the difference (\(x_i = x_i^\text{after} - x_i^\text{before}\)) for each store \(i\). If the average difference \(\bar{x}=1/n \sum_i x_i\) is positive (resp. negative), then the marketing campaigns will be considered as efficient (resp. detrimental). We will use bootstrapping to compute the confidence interval (CI) and see if 0 in comprised in the CI.
x = df.after - df.before
S = np.mean
# 1. Model parameters
theta_hat = S(x)
np.random.seed(15) # set the seed for reproducible results
B = 1000 # Number of bootstrap
# 2. Bootstrapped samples
x_B = [np.random.choice(x, size=len(x), replace=True) for boot_i in range(B)]
# 3. Bootstrap estimates and distribution
theta_hats_B = np.array([S(x_b) for x_b in x_B])
theta_density_B, bins = np.histogram(theta_hats_B, bins=50, density=True)
dx = np.diff(bins)
# 4. Bootstrap Statistics
# Bootstrap estimate
theta_bar_B = np.mean(theta_hats_B)
# Bias = bootstrap estimate - estimate
bias_hat_B = theta_bar_B - theta_hat
# Standard Error
se_hat_B = np.sqrt(1 / (B - 1) * np.sum((theta_bar_B - theta_hats_B) ** 2))
# Confidence interval using the estimated bootstrapped distribution of estimator
ci95 = np.quantile(a=theta_hats_B, q=[0.025, 0.975])
print(
"Est.: {:.2f}, Boot Est.: {:.2f}, bias: {:e},\
Boot SE: {:.2f}, CI: [{:.5f}, {:.5f}]"\
.format(theta_hat, theta_bar_B, bias_hat_B, se_hat_B, ci95[0], ci95[1]))
Est.: 2.26, Boot Est.: 2.19, bias: -7.201946e-02, Boot SE: 0.95, CI: [0.45256, 4.08465]
Conclusion: Zero is outside the CI, moreover \(\bar{X}\) is positive. Thus we can conclude the marketing campaign produced a significant increase of the sales.
Plot
plt.bar(bins[1:], theta_density_B, width=dx, fill=False, label=r'$P_{\theta_B}$')
plt.axvline(x=theta_hat, color='r', ls='--', lw=2, label=r'$\hat{\theta}$ (Obs. Stat.)')
plt.axvline(x=theta_bar_B, color="b", ls='-', label=r'$\bar{\theta}_B$')
plt.axvline(x=ci95[0], color="b", ls='--', label=r'$CI_{95\%}$')
plt.axvline(x=ci95[1], color="b", ls='--')
plt.axvline(x=0, color="k", lw=2, label=r'No effect: 0')
plt.legend()
_ = plt.title(r'Bootstrap distribution of the estimator $\hat{\theta}_b$ (sample mean)')
