Resampling and Monte Carlo Methods
==================================

Sources:

-  `Scipy Resampling and Monte Carlo
   Methods <https://docs.scipy.org/doc/scipy/tutorial/stats/resampling.html>`__

.. code:: ipython3

    # Manipulate data
    import numpy as np
    import pandas as pd
    
    # Statistics
    import scipy.stats
    import statsmodels.api as sm
    #import statsmodels.stats.api as sms
    import statsmodels.formula.api as smf
    from statsmodels.stats.stattools import jarque_bera
    
    # Plot
    import matplotlib.pyplot as plt
    import seaborn as sns
    import pystatsml.plot_utils
    
    # Plot parameters
    plt.style.use('seaborn-v0_8-whitegrid')
    fig_w, fig_h = plt.rcParams.get('figure.figsize')
    plt.rcParams['figure.figsize'] = (fig_w, fig_h * .5)
    %matplotlib inline

Monte-Carlo simulation of Random Walk Process
---------------------------------------------

One-dimensional random walk (Brownian motion)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

More information: `Random Walks, Central Limit
Theorem <https://www.youtube.com/watch?v=BUJCF900I0A>`__

At each step :math:`i` the process moves with +1 or -1 with equal
probability, ie, :math:`X_i \in \{+1, -1\}` with
:math:`P(X_i=+1)=P(X_i=-1)=1/2`. Steps :math:`X_i`\ ’s are i.i.d..

Let :math:`S_n = \sum^n_i X_i`, or :math:`S_i` (at time :math:`i`) is
:math:`S_i = S_{i-1} + X_i`

Realizations of random walks obtained by Monte Carlo simulation Plot Few
random walks (trajectories), ie, :math:`S_n` for :math:`n=0` to
:math:`200`

.. code:: ipython3

    np.random.seed(seed=42)  # make the example reproducible
    
    n = 200 # trajectory depth
    nsamp = 50000 #nb of trajectories
    
    # X: each row (axis 0) contains one trajectory axis 1
    #Xn = np.array([np.random.choice(a=[-1, +1], size=n,
    #                                replace=True, p=np.ones(2) / 2)
    #               for i in range(nsamp)])
    
    Xn = np.array([np.random.choice(a=np.array([-1, +1]), size=n,
                                    replace=True, p=np.ones(2)/2)
                   for i in range(nsamp)])
    
    # Sum of random walks (trajectories)
    Sn = Xn.sum(axis=1)
    
    print("True Stat. Mean={:.03f}, Sd={:.02f}".\
        format(0, np.sqrt(n) * 1))
    
    print("Est. Stat. Mean={:.03f}, Sd={:.02f}".\
        format(Sn.mean(), Sn.std()))
    



.. parsed-literal::

    True Stat. Mean=0.000, Sd=14.14
    Est. Stat. Mean=0.010, Sd=14.09


Plot cumulative sum of 100 random walks (trajectories)

.. code:: ipython3

    Sn_traj = Xn[:100, :].cumsum(axis=1)
    _ = pd.DataFrame(Sn_traj.T).plot(legend=False)



.. image:: stat_montecarlo_files/stat_montecarlo_5_0.png


Distribution of :math:`S_n` vs :math:`\mathcal{N}(0, \sqrt(n))`

.. code:: ipython3

    x_low, x_high = Sn.mean()-3*Sn.std(), Sn.mean()+3*Sn.std()
    h_ = plt.hist(Sn, range=(x_low, x_high), density=True, bins=43, fill=False,
                 label="Histogram")
    
    x_range = np.linspace(x_low, x_high, 30)
    prob_x_range = scipy.stats.norm.pdf(x_range, loc=Sn.mean(), scale=Sn.std())
    plt.plot(x_range, prob_x_range, 'r-', label="PDF: P(X)")
    _ = plt.legend()
    #print(h_)



.. image:: stat_montecarlo_files/stat_montecarlo_7_0.png


Permutation Tests
-----------------

`Permutation test <https://en.wikipedia.org/wiki/Permutation_test>`__:

-  The test involves two or more samples assuming that values can be
   **randomly permuted** under the **null hypothesis**.
-  The test is **Resampling procedure to estimate the distribution of a
   parameter or any statistic under the null hypothesis**, i.e.,
   calculated on the permuted data. This parameter or any statistic is
   called the **estimator**.
-  **Statistical inference** is conducted by computing the proportion of
   permuted values of the estimator that are “more extreme” than the
   true one, providing an estimation of the *p-value*.
-  Permutation tests are a subset of non-parametric statistics, useful
   when the distribution of the estimator (under H0) is unknown or
   requires complicated formulas.

Permutation test procedure

.. figure:: images/stat_random_permutations.png
   :alt: Bootstrapping procedure

   Bootstrapping procedure

1. Estimate the observed parameter or statistic
   :math:`\hat{\theta} = S(X)` on the initial dataset :math:`X` of size
   :math:`N`. We call it the observed statistic.
2. Generate :math:`R` samples (called randomized samples)
   :math:`[X_1, \ldots X_r, \ldots X_R]` from the initial dataset by
   permutation of the values between the two samples.
3. Distribution of the estimator under HO: For each random sample
   :math:`r`, compute the estimator :math:`\hat{\theta}_r = S(X_r)`. The
   set of :math:`\{\hat{\theta}_r\}` provides an estimate the
   distribution of :math:`P(\theta | H0)` (under the null hypothesis).
4. Compute statistics of the estimator under the null hypothesis using
   randomized estimates :math:`\hat{\theta}_r`\ ’s:

-  Mean (under :math:`H0`):

   .. math::


      \bar{\theta}_R = \frac{1}{r}\sum_{r=1}^R \hat{\theta}_r

-  Standard Error :math:`\hat{SE}_{\theta_R}` (under :math:`H0`):

   .. math::


      \hat{SE}_{\theta_R} = \sqrt{\frac{1}{R-1}\sum_{r=1}^K (\bar{\theta}_R - \hat{\theta}_r)^2}

-  Compute p-value using the distribution (under :math:`H0`):

   -  One-sided p-value:

      .. math::


         P(\theta \ge \hat{\theta} | H0) \approx \frac{\mathbf{card}(\hat{\theta}_r \ge \hat{\theta})}{R}

   -  Two-sided p-value:

      .. math::


         P(\theta \le \hat{\theta}~\text{or}~\theta \ge \hat{\theta} | H0) \approx \frac{\mathbf{card}(\hat{\theta}_r \le \hat{\theta}) + \mathbf{card}(\hat{\theta}_r \ge \hat{\theta})}{R}

.. code:: ipython3

    def randomize(x, estimator, n_perms=1000):
        """_summary_
    
        Parameters
        ----------
        x : array
            the datasets
        estimator : callable
            the estimator function taking x as argument returning the estimator value
            (scalar) 
        n_perms : int, optional
            the number of permutation, by default 1000
            
        Return
        ------
        Observed estimate
        Mean of randomized estimates
        Standard Error of randomized estimates
        Two-sided p-value
        Randomized distribution estimates (density_values, bins)
        """
           
        # 1. Estimate the observed parameter or statistic
        theta = estimator(x)
    
        # 2. Permuted samples
        # Randomly pick the sign with the function:
        # np.random.choice([-1, 1], size=len(x), replace=True)
    
        x_R = [np.random.choice([-1, 1], size=len(x), replace=True) * x
            for boot_i in range(n_perms)]
    
        # 3. Distribution of the parameter or statistic under H0
    
        thetas_R = np.array([estimator(x_r) for x_r in x_R])
        theta_density_R, bins = np.histogram(thetas_R, bins=50, density=True)
        dx = np.diff(bins)
    
        # 4. Randomized Statistics
    
        # Mean of randomized estimates
        theta_mean_R = np.mean(thetas_R)
    
        # Standard Error of randomized estimates
        theta_se_R = np.sqrt(1 / (n_perms - 1) * 
                             np.sum((theta_mean_R - thetas_R) ** 2))
    
        # 4. Compute two-sided p-value using the distribution under H0:
    
        extream_vals = (thetas_R <= -np.abs(theta)) | (thetas_R >= np.abs(theta))
        pval = np.sum(extream_vals) / n_perms
        # We could use:
        # (np.sum(thetas_R <= -np.abs(theta)) + \
        #  np.sum(thetas_R >=  np.abs(theta))) / n_perms
        
        return theta, theta_mean_R, theta_se_R, pval, (theta_density_R, bins)

Example, we load the monthly revenue figures of for 100 stores before
and after a marketing campaigns. We compute the difference
(:math:`x_i = x_i^\text{after} - x_i^\text{before}`) for each store
:math:`i`. Under the null hypothesis, i.e., no effect of the campaigns,
:math:`x_i^\text{after}` and :math:`x_i^\text{before}` could be
permuted, which is equivalent to randomly switch the sign of
:math:`x_i`. Here we will focus on the sample mean
:math:`\hat{\theta} = S(X) = 1/n\sum_i x_i` as statistic of interest.

.. code:: ipython3

    df = pd.read_csv("../datasets/Monthly Revenue (in thousands).csv")
    # Reshape Keep only the 30 first samples
    df = df.pivot(index='store_id', columns='time', values='revenue')[:30]
    df.after -= 3 # force to smaller effect size

.. code:: ipython3

    x = df.after - df.before
    
    plt.hist(x, fill=False)
    plt.axvline(x=0, color="b", label=r'No effect: 0')
    plt.axvline(x=x.mean(), color="r", ls='-', label=r'$\bar{x}=%.2f$' % x.mean())
    plt.legend()
    _ = plt.title(r'Distribution of the sales changes $x_i = x_i^\text{after} - \
        x_i^\text{before}$')



.. image:: stat_montecarlo_files/stat_montecarlo_12_0.png


.. code:: ipython3

    np.random.seed(15) # set the seed for reproducible results
    
    theta, theta_mean_R, theta_se_R, pval, (theta_density_R, bins) = \
        randomize(x=x, estimator=np.mean, n_perms=10000)
    
    print("Estimate: {:.2f}, Mean(Estimate|HO): {:.4f}, p-val: {:e}, SE: {:.3f}".\
        format(theta, theta_mean_R, pval, theta_se_R))


.. parsed-literal::

    Estimate: 2.26, Mean(Estimate|HO): -0.0103, p-val: 2.430000e-02, SE: 1.018


Plot

.. code:: ipython3

    pystatsml.plot_utils.plot_pvalue_under_h0(stat_vals=bins[1:], stat_probs=theta_density_R,
                         stat_obs=theta,  stat_h0=0, bar_width=np.diff(bins),
                         thresh_low=-np.abs(theta), thresh_high=np.abs(theta))
    _ = plt.title(r'Randomized distribution of the estimator $\hat{\theta}_b$ (sample mean)')



.. image:: stat_montecarlo_files/stat_montecarlo_15_0.png


Similar procedure can be conducted with many statistic e.g., the
t-statistic (same results):

.. code:: ipython3

    def ttstat(x):
        return (np.mean(x) - 0) / np.std(x, ddof=1) * np.sqrt(len(x))
    
    
    np.random.seed(15) # set the seed for reproducible results
    theta, theta_mean_R, theta_se_R, pval, (theta_density_R, bins) = \
        randomize(x=x, estimator=ttstat, n_perms=10000)
    
    print("Estimate: {:.2f}, Mean(Estimate|HO): {:.4f}, p-val: {:e}, SE: {:.3f}".\
        format(theta, theta_mean_R, pval, theta_se_R))


.. parsed-literal::

    Estimate: 2.36, Mean(Estimate|HO): -0.0106, p-val: 2.430000e-02, SE: 1.025


Or the median:

.. code:: ipython3

    np.random.seed(15) # set the seed for reproducible results
    theta, theta_mean_R, theta_se_R, pval, (theta_density_R, bins) = \
        randomize(x=x, estimator=np.median, n_perms=10000)
    
    print("Estimate: {:.2f}, Mean(Estimate|HO): {:.4f}, p-val: {:e}, SE: {:.3f}".\
        format(theta, theta_mean_R, pval, theta_se_R))


.. parsed-literal::

    Estimate: 1.85, Mean(Estimate|HO): -0.0144, p-val: 9.370000e-02, SE: 1.066




Bootstrapping
-------------

`Bootstrapping <https://en.wikipedia.org/wiki/Bootstrapping_(statistics)>`__:

-  **Resampling procedure** to estimate the distribution of a statistic
   or parameter of interest, called the estimator.
-  Derive estimates of **variability for estimator** (bias, standard
   error, confidence intervals, etc.).
-  **Statistical inference** is conducted by looking the `confidence
   interval <https://www.researchgate.net/publication/296473825_Bootstrap_Confidence_Intervals_for_Noise_Indicators>`__
   **(CI) contains the null hypothesis**.
-  Nonparametric approach statistical inference, useful when model
   assumptions is in doubt, unknown or requires complicated formulas.
-  **Bootstrapping with replacement** has favorable performances (`Efron
   1979 <https://projecteuclid.org/journals/annals-of-statistics/volume-7/issue-1/Bootstrap-Methods-Another-Look-at-the-Jackknife/10.1214/aos/1176344552.full>`__,
   and
   `1986 <https://projecteuclid.org/download/pdf_1/euclid.ss/1177013815>`__)
   compared to prior methods like the jackknife that sample without
   replacement
-  **Regularize models** by fitting several models on bootstrap samples
   and averaging their predictions (see Baging and random-forest). See
   machine learning chapter.

Note that compared to random permutation, bootstrapping sample the
distribution under the **alternative hypothesis**, it doesn’t consider
the distribution under the null hypothesis. A great advantage of
bootstrap is its **simplicity of the procedure**:

.. figure:: images/stat_bootstrapping.png
   :alt: Bootstrapping procedure

   Bootstrapping procedure

1. Compute the estimator :math:`\hat{\theta} = S(X)` on the initial
   dataset :math:`X` of size :math:`N`.
2. Generate :math:`B` samples (called bootstrap samples)
   :math:`[X_1, \ldots X_b, \ldots X_B]` from the initial dataset by
   randomly drawing **with replacement** :math:`N` observations.
3. For each sample :math:`b` compute the estimator
   :math:`\hat{\theta}_b = S(X_b)`, which provides the bootstrap
   distribution :math:`P_{\theta_B}` of the estimator.
4. Compute statistics of the estimator on bootstrap estimates
   :math:`\hat{\theta}_b`\ ’s:

-  Bootstrap estimate (of the parameters):

   .. math::


      \bar{\theta}_B = \frac{1}{B}\sum_{b=1}^B \hat{\theta}_b

-  Bias = bootstrap estimate - estimate:

   .. math::


      \hat{b}_{\theta_B} = \bar{\theta}_B - \hat{\theta}

-  Standard error :math:`\hat{S}_{\theta_B}`:

   .. math::


      \hat{S}_{\theta_B} = \sqrt{\frac{1}{B-1}\sum_{b=1}^B (\bar{\theta}_B - \hat{\theta}_b)^2}

-  Confidence interval using the estimated bootstrapped distribution of
   the estimator:

   .. math::


      CI_{95\%}=[\hat{\theta}_1=Q(2.5\%), \hat{\theta}_2=Q(97.5\%)]~\text{i.e., the 2.5\%, 97.5\% quantiles estimators out of the}~\hat{\theta}_b's

Application using the monthly revenue of 100 stores before and after a
marketing campaigns, using the difference
(:math:`x_i = x_i^\text{after} - x_i^\text{before}`) for each store
:math:`i`. If the average difference :math:`\bar{x}=1/n \sum_i x_i` is
positive (resp. negative), then the marketing campaigns will be
considered as efficient (resp. detrimental). We will use bootstrapping
to compute the confidence interval (CI) and see if 0 in comprised in the
CI.

.. code:: ipython3

    x = df.after - df.before
    S = np.mean
    
    # 1. Model parameters
    theta_hat = S(x)
    
    np.random.seed(15) # set the seed for reproducible results
    B = 1000 # Number of bootstrap
    
    # 2. Bootstrapped samples
    
    x_B = [np.random.choice(x, size=len(x), replace=True) for boot_i in range(B)]
    
    # 3. Bootstrap estimates and distribution
    
    theta_hats_B = np.array([S(x_b) for x_b in x_B])
    theta_density_B, bins = np.histogram(theta_hats_B, bins=50, density=True)
    dx = np.diff(bins)
    
    # 4. Bootstrap Statistics
    
    # Bootstrap estimate
    theta_bar_B = np.mean(theta_hats_B)
    
    # Bias = bootstrap estimate - estimate
    bias_hat_B = theta_bar_B - theta_hat
    
    # Standard Error
    se_hat_B = np.sqrt(1 / (B - 1) * np.sum((theta_bar_B - theta_hats_B) ** 2))
    
    # Confidence interval using the estimated bootstrapped distribution of estimator
    
    ci95 = np.quantile(a=theta_hats_B, q=[0.025, 0.975])
    
    print(
        "Est.: {:.2f}, Boot Est.: {:.2f}, bias: {:e},\
         Boot SE: {:.2f}, CI: [{:.5f}, {:.5f}]"\
        .format(theta_hat, theta_bar_B, bias_hat_B, se_hat_B, ci95[0], ci95[1]))


.. parsed-literal::

    Est.: 2.26, Boot Est.: 2.19, bias: -7.201946e-02,     Boot SE: 0.95, CI: [0.45256, 4.08465]


Conclusion: Zero is outside the CI, moreover :math:`\bar{X}` is
positive. Thus we can conclude the marketing campaign produced a
significant increase of the sales.

Plot

.. code:: ipython3

    plt.bar(bins[1:], theta_density_B, width=dx, fill=False, label=r'$P_{\theta_B}$')
    plt.axvline(x=theta_hat, color='r', ls='--', lw=2, label=r'$\hat{\theta}$ (Obs. Stat.)')
    plt.axvline(x=theta_bar_B, color="b", ls='-', label=r'$\bar{\theta}_B$')
    plt.axvline(x=ci95[0], color="b", ls='--', label=r'$CI_{95\%}$')
    plt.axvline(x=ci95[1], color="b", ls='--')
    plt.axvline(x=0, color="k", lw=2, label=r'No effect: 0')
    plt.legend()
    _ = plt.title(r'Bootstrap distribution of the estimator $\hat{\theta}_b$ (sample mean)')




.. image:: stat_montecarlo_files/stat_montecarlo_25_0.png