.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_gallery/ml_resampling.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_auto_gallery_ml_resampling.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_gallery_ml_resampling.py:


Resampling methods
==================

.. GENERATED FROM PYTHON SOURCE LINES 5-20

.. code-block:: default


    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns

    from sklearn import datasets
    import sklearn.linear_model as lm
    from sklearn.model_selection import train_test_split, KFold, PredefinedSplit
    from sklearn.model_selection import cross_val_score, GridSearchCV

    import sklearn.metrics as metrics
    X, y = datasets.make_regression(n_samples=100, n_features=100,
                                    n_informative=10, random_state=42)


.. GENERATED FROM PYTHON SOURCE LINES 21-63

Train, validation and test sets
-------------------------------

Machine learning algorithms overfit taining data. Predictive performances **MUST** be evaluated on independant hold-out dataset.

.. figure:: ../images/train_val_test_cv.png
   :alt: Train, validation and test sets.

1. **Training dataset**: Dataset used to fit the model
   (set the model parameters like weights). The *training error* can be
   easily calculated by applying the statistical learning method to the
   observations used in its training. But because of overfitting, the
   **training error rate can dramatically underestimate the error** that
   would be obtained on new samples.
2. **Validation dataset**: Dataset used to provide an unbiased evaluation
   of a model fit on the training dataset while
   **tuning model hyperparameters**, ie. **model selection**.
   The validation error is the average error that results from a learning
   method to predict the response on a new (validation) samples that is,
   on samples that were not used in training the method.
3. **Test dataset**: Dataset used to provide an unbiased
   **evaluation of a final model** fitted on the training dataset.
   It is only used once a model is completely trained (using the train and
   validation sets).

What is the Difference Between Test and Validation Datasets? by
[Jason Brownlee](https://machinelearningmastery.com/difference-test-validation-datasets/)

Thus the original dataset is generally split in a training, validation and a
test data sets. Large training+validation set (80%) small test set (20%) might
provide a poor estimation of the predictive performances (same argument
stands for train vs validation samples). On the contrary, large test set and
small training set might produce a poorly estimated learner.
This is why, on situation where we cannot afford such split, cross-validation
scheme can be use for model selection or/and for model evaluation.

If sample size is limited, train/validation/test split may not be possible.
**Cross Validation (CV)** (see below) can be used to replace:

- Outer (train/test) split of model evaluation.
- Inner train/validation split of model selection (more frequent situation).
- Inner and outer splits, leading to two nested CV.

.. GENERATED FROM PYTHON SOURCE LINES 66-69

Split dataset in train/test sets for model evaluation
-----------------------------------------------------


.. GENERATED FROM PYTHON SOURCE LINES 69-81

.. code-block:: default


    X_train, X_test, y_train, y_test =\
        train_test_split(X, y, test_size=0.25, shuffle=True, random_state=42)

    mod = lm.Ridge(alpha=10)

    mod.fit(X_train, y_train)

    y_pred_test = mod.predict(X_test)
    print("Test R2: %.2f" % metrics.r2_score(y_test, y_pred_test))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    Test R2: 0.74


.. GENERATED FROM PYTHON SOURCE LINES 82-101

Train/validation/test splits: model selection and model evaluation
------------------------------------------------------------------

The **Grid search procedure** (`GridSearchCV`) performs a
model selection of the best **hyper-parameters** :math:`\alpha` over a grid of possible values.
Train set is  "splitted (inner split) into train/validation sets.

**Model selection with grid search procedure:**

1. Fit the learner (\ie. estimate **parameters** :math:`\mathbf{\Omega}_k`)
   on training set: :math:`\mathbf{X}_{train}, \mathbf{y}_{train} \rightarrow f_{\alpha_k, \mathbf{\Omega}_k}(.)`
2. Evaluate the model on the validation set and keep the hyper-parameter(s) that
   minimises the error measure :math:`\alpha_* = \arg \min L(f_{\alpha_k, \mathbf{\Omega}_k}(\mathbf{X}_{val}), \mathbf{y}_{val})`
3. Refit the learner on all training + validation data,
   :math:`\mathbf{X}_{train \cup val}, \mathbf{y}_{train \cup val}`,
   using the best hyper parameters (:math:`\alpha_*`): :math:`\rightarrow f_{\alpha_*, \mathbf{\Omega}_*}(.)`

**Model evaluation:** on the test set:
:math:`L(f_{\alpha_*, \mathbf{\Omega}_*}(\mathbf{X}_{test}), \mathbf{y}_{test})`

.. GENERATED FROM PYTHON SOURCE LINES 101-122

.. code-block:: default


    train_idx, validation_idx = train_test_split(np.arange(X_train.shape[0]),
                                                 test_size=0.25, shuffle=True,
                                                 random_state=42)

    split_inner = PredefinedSplit(test_fold=validation_idx)
    print("Train set size: %i" % X_train[train_idx].shape[0])
    print("Validation set size: %i" % X_train[validation_idx].shape[0])
    print("Test set size: %i" % X_test.shape[0])

    lm_cv = GridSearchCV(lm.Ridge(), {'alpha': 10. ** np.arange(-3, 3)},
                         cv=split_inner, n_jobs=5)

    # Fit, indluding model selection with internal Train/validation split
    lm_cv.fit(X_train, y_train)

    # Predict
    y_pred_test = lm_cv.predict(X_test)
    print("Test R2: %.2f" % metrics.r2_score(y_test, y_pred_test))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    Train set size: 56
    Validation set size: 19
    Test set size: 25
    Test R2: 0.80


.. GENERATED FROM PYTHON SOURCE LINES 123-179

Cross-Validation (CV)
---------------------

If sample size is limited, train/validation/test split may not be possible.
**Cross Validation (CV)** can be used to replace train/validation split
and/or train+validation / test split.

Cross-Validation scheme randomly divides the set of observations into
*K* groups, or **folds**, of approximately equal size.
The first fold is treated as a validation set, and the method
:math:`f()` is fitted on the remaining union of *K - 1* folds:
(:math:`f(\boldsymbol{X}_{-K}, \boldsymbol{y}_{-K})`).
The measure of performance (the score function :math:`\mathcal{S}`),
either a error measure or an correct prediction measure is an average
of a loss error or correct prediction measure, noted :math:`\mathcal{L}`,
between a true target value and the predicted target value.
The score function is evaluated of the on the observations in the held-out
fold. For each sample *i* we consider the model estimated
:math:`f(\boldsymbol{X}_{-k(i)}, \boldsymbol{y}_{-k(i)}` on the data set
without the group *k* that contains *i* noted *-k(i)*.
This procedure is repeated *K* times; each time, a different group of
observations is treated as a test set.
Then we compare the predicted value
(:math:`f_{-k(i)}(\boldsymbol{x}_i) = \hat{y_i})`
with true value :math:`y_i` using a Error or Loss function
:math:`\mathcal{L}(y, \hat{y})`.

For 10-fold we can either average over 10 values (Macro measure) or
concatenate the 10 experiments and compute the micro measures.

Two strategies [micro vs macro estimates](https://stats.stackexchange.com/questions/34611/meanscores-vs-scoreconcatenation-in-cross-validation):

- **Micro measure: average(individual scores)**: compute a score
  :math:`\mathcal{S}` for each sample and average over all samples.
  It is simillar to **average score(concatenation)**: an averaged score
  computed over all concatenated samples.

.. raw:: latex
   \mathcal{S}(f) = \frac{1}{N} \sum_i^N \mathcal{L}\left(y_i, f(\boldsymbol{x}_{-k(i)}, \boldsymbol{y}_{-k(i)}) \right).

- **Macro measure mean(CV scores)** (the most commonly used method):
  compute a score :math:`\mathcal{S}` on each each fold *k* and average
  accross folds:

.. raw:: latex
   \begin{align*}
   \mathcal{S}(f) &= \frac{1}{K} \sum_k^K \mathcal{S}_k(f).\\
   \mathcal{S}(f) &= \frac{1}{K} \sum_k^K \frac{1}{N_k} \sum_{i \in k} \mathcal{L}\left(y_i, f(\boldsymbol{x}_{-k(i)}, \boldsymbol{y}_{-k(i)}) \right).
   \end{align*}

These two measures (an average of average vs. a global average) are generaly
similar. They may differ slightly is folds are of different sizes.
This validation scheme is known as the **K-Fold CV**.
Typical choices of *K* are 5 or 10, [Kohavi 1995].
The extreme case where *K = N* is known as **leave-one-out cross-validation,
LOO-CV**.

.. GENERATED FROM PYTHON SOURCE LINES 181-188

CV for regression
~~~~~~~~~~~~~~~~~

Usually the error function :math:`\mathcal{L}()` is the r-squared score.
However other function (MAE, MSE) can be used.

**CV with explicit loop:**

.. GENERATED FROM PYTHON SOURCE LINES 188-204

.. code-block:: default


    from sklearn.model_selection import KFold

    estimator = lm.Ridge(alpha=10)

    cv = KFold(n_splits=5, shuffle=True, random_state=42)
    r2_train, r2_test = list(), list()

    for train, test in cv.split(X):
        estimator.fit(X[train, :], y[train])
        r2_train.append(metrics.r2_score(y[train], estimator.predict(X[train, :])))
        r2_test.append(metrics.r2_score(y[test], estimator.predict(X[test, :])))

    print("Train r2:%.2f" % np.mean(r2_train))
    print("Test  r2:%.2f" % np.mean(r2_test))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    Train r2:0.99
    Test  r2:0.67


.. GENERATED FROM PYTHON SOURCE LINES 205-208

Scikit-learn provides user-friendly function to perform CV:

`cross_val_score()`: single metric

.. GENERATED FROM PYTHON SOURCE LINES 208-219

.. code-block:: default


    from sklearn.model_selection import cross_val_score

    scores = cross_val_score(estimator=estimator, X=X, y=y, cv=5)
    print("Test  r2:%.2f" % scores.mean())

    cv = KFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(estimator=estimator, X=X, y=y, cv=cv)
    print("Test  r2:%.2f" % scores.mean())


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    Test  r2:0.73
    Test  r2:0.67


.. GENERATED FROM PYTHON SOURCE LINES 220-221

`cross_validate()`: multi metric, + time, etc.

.. GENERATED FROM PYTHON SOURCE LINES 221-231

.. code-block:: default


    from sklearn.model_selection import cross_validate

    scores = cross_validate(estimator=mod, X=X, y=y, cv=cv,
                            scoring=['r2', 'neg_mean_absolute_error'])

    print("Test R2:%.2f; MAE:%.2f" % (scores['test_r2'].mean(),
                                      -scores['test_neg_mean_absolute_error'].mean()))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    Test R2:0.67; MAE:55.27


.. GENERATED FROM PYTHON SOURCE LINES 232-244

CV for classification: stratifiy for the target label
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

With classification problems it is essential to sample folds where each
set contains approximately the same percentage of samples of each target
class as the complete set. This is called **stratification**.
In this case, we will use ``StratifiedKFold`` with is a variation of
k-fold which returns stratified folds.
Usually the error function :math:`L()` are, at least, the sensitivity
and the specificity. However other function could be used.

**CV with explicit loop**:

.. GENERATED FROM PYTHON SOURCE LINES 244-265

.. code-block:: default


    from sklearn.model_selection import StratifiedKFold

    X, y = datasets.make_classification(n_samples=100, n_features=100, shuffle=True,
                                        n_informative=10, random_state=42)

    mod = lm.LogisticRegression(C=1, solver='lbfgs')

    cv = StratifiedKFold(n_splits=5)

    # Lists to store scores by folds (for macro measure only)
    bacc, auc = [], []

    for train, test in cv.split(X, y):
        mod.fit(X[train, :], y[train])
        bacc.append(metrics.roc_auc_score(y[test], mod.decision_function(X[test, :])))
        auc.append(metrics.balanced_accuracy_score(y[test], mod.predict(X[test, :])))

    print("Test AUC:%.2f; bACC:%.2f" % (np.mean(bacc), np.mean(auc)))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    Test AUC:0.86; bACC:0.80


.. GENERATED FROM PYTHON SOURCE LINES 266-267

`cross_val_score()`: single metric

.. GENERATED FROM PYTHON SOURCE LINES 267-273

.. code-block:: default


    scores = cross_val_score(estimator=mod, X=X, y=y, cv=5)

    print("Test  ACC:%.2f" % scores.mean())


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    Test  ACC:0.80


.. GENERATED FROM PYTHON SOURCE LINES 274-275

Provide your own CV and score

.. GENERATED FROM PYTHON SOURCE LINES 275-284

.. code-block:: default

    def balanced_acc(estimator, X, y, **kwargs):
        """Balanced acuracy scorer."""
        return metrics.recall_score(y, estimator.predict(X), average=None).mean()

    scores = cross_val_score(estimator=mod, X=X, y=y, cv=cv,
                             scoring=balanced_acc)
    print("Test  bACC:%.2f" % scores.mean())


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    Test  bACC:0.80


.. GENERATED FROM PYTHON SOURCE LINES 285-286

`cross_validate()`: multi metric, + time, etc.

.. GENERATED FROM PYTHON SOURCE LINES 286-297

.. code-block:: default


    from sklearn.model_selection import cross_validate

    scores = cross_validate(estimator=mod, X=X, y=y, cv=cv,
                            scoring=['balanced_accuracy', 'roc_auc'])

    print("Test AUC:%.2f; bACC:%.2f" % (scores['test_roc_auc'].mean(),
                                        scores['test_balanced_accuracy'].mean()))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    Test AUC:0.86; bACC:0.80


.. GENERATED FROM PYTHON SOURCE LINES 298-304

Cross-validation for model selection
------------------------------------

Combine CV and grid search:
Re-split (inner split) train set into CV folds train/validation folds and
build a `GridSearchCV` out of it:

.. GENERATED FROM PYTHON SOURCE LINES 304-323

.. code-block:: default


    # Outer split:
    X_train, X_test, y_train, y_test =\
        train_test_split(X, y, test_size=0.25, shuffle=True, random_state=42)

    cv_inner = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    # Cross-validation for model selection
    lm_cv = GridSearchCV(lm.LogisticRegression(), {'C': 10. ** np.arange(-3, 3)},
                         cv=cv_inner, n_jobs=5)

    # Fit, indluding model selection with internal CV
    lm_cv.fit(X_train, y_train)

    # Predict
    y_pred_test = lm_cv.predict(X_test)
    print("Test bACC: %.2f" % metrics.balanced_accuracy_score(y_test, y_pred_test))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    Test bACC: 0.63


.. GENERATED FROM PYTHON SOURCE LINES 324-326

Cross-validation for both model (outer) evaluation and model (inner) selection
------------------------------------------------------------------------------

.. GENERATED FROM PYTHON SOURCE LINES 326-342

.. code-block:: default


    cv_outer = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    cv_inner = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    # Cross-validation for model (inner) selection
    lm_cv = GridSearchCV(lm.Ridge(), {'alpha': 10. ** np.arange(-3, 3)},
                         cv=cv_inner, n_jobs=5)

    # Cross-validation for model (outer) evaluation
    scores = cross_validate(estimator=mod, X=X, y=y, cv=cv_outer,
                            scoring=['balanced_accuracy', 'roc_auc'])

    print("Test AUC:%.2f; bACC:%.2f, Time: %.2fs" % (scores['test_roc_auc'].mean(),
                                            scores['test_balanced_accuracy'].mean(),
                                            scores['fit_time'].sum()))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    Test AUC:0.85; bACC:0.74, Time: 0.03s


.. GENERATED FROM PYTHON SOURCE LINES 343-349

Models with built-in cross-validation
--------------------------------------

Let sklearn select the best parameters over a default grid.

**Classification**

.. GENERATED FROM PYTHON SOURCE LINES 349-356

.. code-block:: default


    print("== Logistic Ridge (L2 penalty) ==")
    mod_cv = lm.LogisticRegressionCV(class_weight='balanced', scoring='balanced_accuracy',
                                     n_jobs=-1, cv=5)
    scores = cross_val_score(estimator=mod_cv, X=X, y=y, cv=5)
    print("Test  ACC:%.2f" % scores.mean())


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    == Logistic Ridge (L2 penalty) ==
    Test  ACC:0.78


.. GENERATED FROM PYTHON SOURCE LINES 357-358

**Regression**

.. GENERATED FROM PYTHON SOURCE LINES 358-378

.. code-block:: default


    X, y, coef = datasets.make_regression(n_samples=50, n_features=100, noise=10,
                             n_informative=2, random_state=42, coef=True)

    print("== Ridge (L2 penalty) ==")
    model = lm.RidgeCV(cv=3)
    scores = cross_val_score(estimator=model, X=X, y=y, cv=5)
    print("Test  r2:%.2f" % scores.mean())

    print("== Lasso (L1 penalty) ==")
    model = lm.LassoCV(n_jobs=-1, cv=3)
    scores = cross_val_score(estimator=model, X=X, y=y, cv=5)
    print("Test  r2:%.2f" % scores.mean())

    print("== ElasticNet (L1 penalty) ==")
    model = lm.ElasticNetCV(l1_ratio=[.1, .5, .9], n_jobs=-1, cv=3)
    scores = cross_val_score(estimator=model, X=X, y=y, cv=5)
    print("Test  r2:%.2f" % scores.mean())


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    == Ridge (L2 penalty) ==
    Test  r2:0.16
    == Lasso (L1 penalty) ==
    Test  r2:0.74
    == ElasticNet (L1 penalty) ==
    Test  r2:0.58


.. GENERATED FROM PYTHON SOURCE LINES 379-393

Random Permutations: sample the null distribution
-------------------------------------------------

A permutation test is a type of non-parametric randomization test in which the null distribution of a test statistic is estimated by randomly permuting the observations.

Permutation tests are highly attractive because they make no assumptions other than that the observations are independent and identically distributed under the null hypothesis.

1. Compute a observed statistic :math:`t_{obs}` on the data.
2. Use randomization to compute the distribution of :math:`t` under the null hypothesis: Perform :math:`N` random permutation of the data. For each sample of permuted data, :math:`i` the data compute the statistic :math:`t_i`. This procedure provides the distribution of *t* under the null hypothesis :math:`H_0`: :math:`P(t \vert H_0)`
3. Compute the p-value = :math:`P(t>t_{obs} | H_0) \left\vert\{t_i > t_{obs}\}\right\vert`, where :math:`t_i's include :math:`t_{obs}`.

Example Ridge regression

Sample the distributions of r-squared and coefficients of ridge regression under the null hypothesis. Simulated dataset:

.. GENERATED FROM PYTHON SOURCE LINES 393-406

.. code-block:: default


    # Regression dataset where first 2 features are predictives
    np.random.seed(0)
    n_features = 5
    n_features_info = 2
    n_samples = 100
    X = np.random.randn(100, 5)
    beta = np.zeros(n_features)
    beta[:n_features_info] = 1
    Xbeta = np.dot(X, beta)
    eps = np.random.randn(n_samples)
    y = Xbeta + eps


.. GENERATED FROM PYTHON SOURCE LINES 407-409

Random permutations
-------------------

.. GENERATED FROM PYTHON SOURCE LINES 409-439

.. code-block:: default


    # Fit model on all data (!! risk of overfit)
    model = lm.RidgeCV()
    model.fit(X, y)
    print("Coefficients on all data:")
    print(model.coef_)

    # Random permutation loop
    nperm = 1000  # !! Should be at least 1000 (to assess a p-value at 1%)
    scores_names = ["r2"]
    scores_perm = np.zeros((nperm + 1, len(scores_names)))
    coefs_perm = np.zeros((nperm + 1, X.shape[1]))

    scores_perm[0, :] = metrics.r2_score(y, model.predict(X))
    coefs_perm[0, :] = model.coef_

    orig_all = np.arange(X.shape[0])
    for perm_i in range(1, nperm + 1):
        model.fit(X, np.random.permutation(y))
        y_pred = model.predict(X).ravel()
        scores_perm[perm_i, :] = metrics.r2_score(y, y_pred)
        coefs_perm[perm_i, :] = model.coef_

    # One-tailed empirical p-value
    pval_pred_perm = np.sum(scores_perm >= scores_perm[0]) / scores_perm.shape[0]
    pval_coef_perm = np.sum(coefs_perm >= coefs_perm[0, :], axis=0) / coefs_perm.shape[0]

    print("R2 p-value: %.3f" % pval_pred_perm)
    print("Coeficients p-values:", np.round(pval_coef_perm, 3))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    Coefficients on all data:
    [ 1.02  1.06  0.21 -0.02 -0.05]
    R2 p-value: 0.001
    Coeficients p-values: [0.   0.   0.1  0.57 0.63]


.. GENERATED FROM PYTHON SOURCE LINES 440-442

Compute p-values corrected for multiple comparisons using FWER max-T
(Westfall and Young, 1993) procedure.

.. GENERATED FROM PYTHON SOURCE LINES 442-448

.. code-block:: default


    pval_coef_perm_tmax = np.array([np.sum(coefs_perm.max(axis=1) >= coefs_perm[0, j])
                                    for j in range(coefs_perm.shape[1])]) / coefs_perm.shape[0]
    print("P-values with FWER (Westfall and Young) correction")
    print(pval_coef_perm_tmax)


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    P-values with FWER (Westfall and Young) correction
    [0.   0.   0.41 0.98 0.99]


.. GENERATED FROM PYTHON SOURCE LINES 449-452

Plot distribution of third coefficient under null-hypothesis
Coeffitients 0 and 1 are significantly different from 0.


.. GENERATED FROM PYTHON SOURCE LINES 452-480

.. code-block:: default


    def hist_pvalue(perms, ax, name):
        """Plot statistic distribution as histogram.

        Paramters
        ---------
        perms: 1d array, statistics under the null hypothesis.
               perms[0] is the true statistic .
        """
        # Re-weight to obtain distribution
        pval = np.sum(perms >= perms[0]) / perms.shape[0]
        weights = np.ones(perms.shape[0]) / perms.shape[0]
        ax.hist([perms[perms >= perms[0]], perms], histtype='stepfilled',
                 bins=100, label="p-val<%.3f" % pval,
                 weights=[weights[perms >= perms[0]], weights])
        ax.axvline(x=perms[0], color="k", linewidth=2)#, label="observed statistic")
        ax.set_ylabel(name)
        ax.legend()
        return ax

    n_coef = coefs_perm.shape[1]
    fig, axes = plt.subplots(n_coef, 1, figsize=(12, 9))
    for i in range(n_coef):
        hist_pvalue( coefs_perm[:, i], axes[i], str(i))

    _ = axes[-1].set_xlabel("Coefficient distribution under null hypothesis")


.. image:: /auto_gallery/images/sphx_glr_ml_resampling_001.png
    :alt: ml resampling
    :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 481-488

Exercise

Given the logistic regression presented above and its validation given a 5 folds CV.

1. Compute the p-value associated with the prediction accuracy measured with 5CV using a permutation test.

2. Compute the p-value associated with the prediction accuracy using a parametric test.

.. GENERATED FROM PYTHON SOURCE LINES 490-507

Bootstrapping
-------------

Bootstrapping is a statistical technique which consists in generating sample (called bootstrap samples) from an initial dataset of size *N* by randomly drawing with replacement *N* observations. It provides sub-samples with the same distribution than the original dataset. It aims to:

1. Assess the variability (standard error, [confidence intervals.](https://sebastianraschka.com/blog/2016/model-evaluation-selection-part2.html#the-bootstrap-method-and-empirical-confidence-intervals)) of performances scores or estimated parameters (see Efron et al. 1986).
2. Regularize model by fitting several models on bootstrap samples and averaging their predictions (see Baging and random-forest).

A great advantage of bootstrap is its simplicity. It is a straightforward way to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution, such as percentile points, proportions, odds ratio, and correlation coefficients.

1. Perform :math:`B` sampling, with replacement, of the dataset.
2. For each sample :math:`i` fit the model and compute the scores.
3. Assess standard errors and confidence intervals of scores using the scores obtained on the :math:`B` resampled dataset. Or, average models predictions.

References:

[Efron B, Tibshirani R. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat Sci 1986;1:54–75](https://projecteuclid.org/download/pdf_1/euclid.ss/1177013815)

.. GENERATED FROM PYTHON SOURCE LINES 507-526

.. code-block:: default


    # Bootstrap loop
    nboot = 100  # !! Should be at least 1000
    scores_names = ["r2"]
    scores_boot = np.zeros((nboot, len(scores_names)))
    coefs_boot = np.zeros((nboot, X.shape[1]))

    orig_all = np.arange(X.shape[0])
    for boot_i in range(nboot):
        boot_tr = np.random.choice(orig_all, size=len(orig_all), replace=True)
        boot_te = np.setdiff1d(orig_all, boot_tr, assume_unique=False)
        Xtr, ytr = X[boot_tr, :], y[boot_tr]
        Xte, yte = X[boot_te, :], y[boot_te]
        model.fit(Xtr, ytr)
        y_pred = model.predict(Xte).ravel()
        scores_boot[boot_i, :] = metrics.r2_score(yte, y_pred)
        coefs_boot[boot_i, :] = model.coef_


.. GENERATED FROM PYTHON SOURCE LINES 527-529

Compute Mean, SE, CI
Coeffitients 0 and 1 are significantly different from 0.

.. GENERATED FROM PYTHON SOURCE LINES 529-540

.. code-block:: default


    scores_boot = pd.DataFrame(scores_boot, columns=scores_names)
    scores_stat = scores_boot.describe(percentiles=[.975, .5, .025])

    print("r-squared: Mean=%.2f, SE=%.2f, CI=(%.2f %.2f)" %      tuple(scores_stat.loc[["mean", "std", "2.5%", "97.5%"], "r2"]))

    coefs_boot = pd.DataFrame(coefs_boot)
    coefs_stat = coefs_boot.describe(percentiles=[.975, .5, .025])
    print("Coefficients distribution")
    print(coefs_stat)


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    r-squared: Mean=0.59, SE=0.09, CI=(0.40 0.73)
    Coefficients distribution
                0       1         2       3       4
    count  100.00  100.00  1.00e+02  100.00  100.00
    mean     1.02    1.05  2.12e-01   -0.02   -0.05
    std      0.09    0.11  9.75e-02    0.10    0.11
    min      0.63    0.82 -2.69e-03   -0.23   -0.27
    2.5%     0.86    0.88  3.27e-02   -0.20   -0.23
    50%      1.03    1.04  2.17e-01   -0.01   -0.06
    97.5%    1.17    1.29  3.93e-01    0.15    0.14
    max      1.20    1.45  4.33e-01    0.22    0.29


.. GENERATED FROM PYTHON SOURCE LINES 541-542

Plot coefficient distribution

.. GENERATED FROM PYTHON SOURCE LINES 542-549

.. code-block:: default


    df = pd.DataFrame(coefs_boot)
    staked = pd.melt(df, var_name="Variable", value_name="Coef. distribution")
    sns.set_theme(style="whitegrid")
    ax = sns.violinplot(x="Variable", y="Coef. distribution", data=staked)
    _ = ax.axhline(0, ls='--', lw=2, color="black")


.. image:: /auto_gallery/images/sphx_glr_ml_resampling_002.png
    :alt: ml resampling
    :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 550-554

Parallel computation with joblib
--------------------------------

Dataset

.. GENERATED FROM PYTHON SOURCE LINES 554-564

.. code-block:: default


    import numpy as np
    from sklearn import datasets
    import sklearn.linear_model as lm
    import sklearn.metrics as metrics
    from sklearn.model_selection import StratifiedKFold
    X, y = datasets.make_classification(n_samples=20, n_features=5, n_informative=2, random_state=42)
    cv = StratifiedKFold(n_splits=5)


.. GENERATED FROM PYTHON SOURCE LINES 565-566

Use `cross_validate` function

.. GENERATED FROM PYTHON SOURCE LINES 566-574

.. code-block:: default


    from sklearn.model_selection import cross_validate

    estimator = lm.LogisticRegression(C=1, solver='lbfgs')
    cv_results = cross_validate(estimator, X, y, cv=cv, n_jobs=5)
    print(np.mean(cv_results['test_score']), cv_results['test_score'])


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    0.8 [0.5 0.5 1.  1.  1. ]


.. GENERATED FROM PYTHON SOURCE LINES 575-578

Sequential computation

If we want have full control of the operations performed within each fold (retrieve the models parameters, etc.). We would like to parallelize the folowing sequetial code:

.. GENERATED FROM PYTHON SOURCE LINES 578-601

.. code-block:: default


    # In[22]:


    estimator = lm.LogisticRegression(C=1, solver='lbfgs')
    y_test_pred_seq = np.zeros(len(y)) # Store predictions in the original order
    coefs_seq = list()
    for train, test in cv.split(X, y):
        X_train, X_test, y_train, y_test = X[train, :], X[test, :], y[train], y[test]
        estimator.fit(X_train, y_train)
        y_test_pred_seq[test] = estimator.predict(X_test)
        coefs_seq.append(estimator.coef_)

    test_accs = [metrics.accuracy_score(y[test], y_test_pred_seq[test]) for train, test in cv.split(X, y)]
    print(np.mean(test_accs), test_accs)
    coefs_cv = np.array(coefs_seq)
    print(coefs_cv)

    print(coefs_cv.mean(axis=0))
    print("Std Err of the coef")
    print(coefs_cv.std(axis=0) / np.sqrt(coefs_cv.shape[0]))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    0.8 [0.5, 0.5, 1.0, 1.0, 1.0]
    [[[-0.88  0.63  1.19 -0.31 -0.38]]

     [[-0.75  0.62  1.1   0.2  -0.4 ]]

     [[-0.96  0.51  1.12  0.08 -0.26]]

     [[-0.86  0.52  1.07 -0.11 -0.29]]

     [[-0.9   0.51  1.09 -0.25 -0.28]]]
    [[-0.87  0.56  1.11 -0.08 -0.32]]
    Std Err of the coef
    [[0.03 0.02 0.02 0.09 0.03]]


.. GENERATED FROM PYTHON SOURCE LINES 602-604

Parallel computation with joblib
--------------------------------

.. GENERATED FROM PYTHON SOURCE LINES 604-633

.. code-block:: default


    from joblib import Parallel, delayed
    from sklearn.base import is_classifier, clone

    def _split_fit_predict(estimator, X, y, train, test):
        X_train, X_test, y_train, y_test = X[train, :], X[test, :], y[train], y[test]
        estimator.fit(X_train, y_train)
        return [estimator.predict(X_test), estimator.coef_]

    estimator = lm.LogisticRegression(C=1, solver='lbfgs')

    parallel = Parallel(n_jobs=5)
    cv_ret = parallel(
        delayed(_split_fit_predict)(
            clone(estimator), X, y, train, test)
        for train, test in cv.split(X, y))

    y_test_pred_cv, coefs_cv = zip(*cv_ret)

    # Retrieve predictions in the original order
    y_test_pred = np.zeros(len(y))
    for i, (train, test) in enumerate(cv.split(X, y)):
        y_test_pred[test] = y_test_pred_cv[i]

    test_accs = [metrics.accuracy_score(y[test], y_test_pred[test]) for train, test in cv.split(X, y)]
    print(np.mean(test_accs), test_accs)


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    0.8 [0.5, 0.5, 1.0, 1.0, 1.0]


.. GENERATED FROM PYTHON SOURCE LINES 634-635

Test same predictions and same coeficients

.. GENERATED FROM PYTHON SOURCE LINES 635-639

.. code-block:: default


    assert np.all(y_test_pred == y_test_pred_seq)
    assert np.allclose(np.array(coefs_cv).squeeze(), np.array(coefs_seq).squeeze())


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  4.727 seconds)


.. _sphx_glr_download_auto_gallery_ml_resampling.py:


.. only :: html

 .. container:: sphx-glr-footer
    :class: sphx-glr-footer-example


  .. container:: sphx-glr-download sphx-glr-download-python

     :download:`Download Python source code: ml_resampling.py <ml_resampling.py>`


  .. container:: sphx-glr-download sphx-glr-download-jupyter

     :download:`Download Jupyter notebook: ml_resampling.ipynb <ml_resampling.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_