.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_gallery/ml_resampling.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_gallery_ml_resampling.py: Resampling methods ================== .. GENERATED FROM PYTHON SOURCE LINES 5-20 .. code-block:: default import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn import datasets import sklearn.linear_model as lm from sklearn.model_selection import train_test_split, KFold, PredefinedSplit from sklearn.model_selection import cross_val_score, GridSearchCV import sklearn.metrics as metrics X, y = datasets.make_regression(n_samples=100, n_features=100, n_informative=10, random_state=42) .. GENERATED FROM PYTHON SOURCE LINES 21-63 Train, validation and test sets ------------------------------- Machine learning algorithms overfit taining data. Predictive performances **MUST** be evaluated on independant hold-out dataset. .. figure:: ../images/train_val_test_cv.png :alt: Train, validation and test sets. 1. **Training dataset**: Dataset used to fit the model (set the model parameters like weights). The *training error* can be easily calculated by applying the statistical learning method to the observations used in its training. But because of overfitting, the **training error rate can dramatically underestimate the error** that would be obtained on new samples. 2. **Validation dataset**: Dataset used to provide an unbiased evaluation of a model fit on the training dataset while **tuning model hyperparameters**, ie. **model selection**. The validation error is the average error that results from a learning method to predict the response on a new (validation) samples that is, on samples that were not used in training the method. 3. **Test dataset**: Dataset used to provide an unbiased **evaluation of a final model** fitted on the training dataset. It is only used once a model is completely trained (using the train and validation sets). What is the Difference Between Test and Validation Datasets? by [Jason Brownlee](https://machinelearningmastery.com/difference-test-validation-datasets/) Thus the original dataset is generally split in a training, validation and a test data sets. Large training+validation set (80%) small test set (20%) might provide a poor estimation of the predictive performances (same argument stands for train vs validation samples). On the contrary, large test set and small training set might produce a poorly estimated learner. This is why, on situation where we cannot afford such split, cross-validation scheme can be use for model selection or/and for model evaluation. If sample size is limited, train/validation/test split may not be possible. **Cross Validation (CV)** (see below) can be used to replace: - Outer (train/test) split of model evaluation. - Inner train/validation split of model selection (more frequent situation). - Inner and outer splits, leading to two nested CV. .. GENERATED FROM PYTHON SOURCE LINES 66-69 Split dataset in train/test sets for model evaluation ----------------------------------------------------- .. GENERATED FROM PYTHON SOURCE LINES 69-81 .. code-block:: default X_train, X_test, y_train, y_test =\ train_test_split(X, y, test_size=0.25, shuffle=True, random_state=42) mod = lm.Ridge(alpha=10) mod.fit(X_train, y_train) y_pred_test = mod.predict(X_test) print("Test R2: %.2f" % metrics.r2_score(y_test, y_pred_test)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Test R2: 0.74 .. GENERATED FROM PYTHON SOURCE LINES 82-101 Train/validation/test splits: model selection and model evaluation ------------------------------------------------------------------ The **Grid search procedure** (`GridSearchCV`) performs a model selection of the best **hyper-parameters** :math:`\alpha` over a grid of possible values. Train set is "splitted (inner split) into train/validation sets. **Model selection with grid search procedure:** 1. Fit the learner (\ie. estimate **parameters** :math:`\mathbf{\Omega}_k`) on training set: :math:`\mathbf{X}_{train}, \mathbf{y}_{train} \rightarrow f_{\alpha_k, \mathbf{\Omega}_k}(.)` 2. Evaluate the model on the validation set and keep the hyper-parameter(s) that minimises the error measure :math:`\alpha_* = \arg \min L(f_{\alpha_k, \mathbf{\Omega}_k}(\mathbf{X}_{val}), \mathbf{y}_{val})` 3. Refit the learner on all training + validation data, :math:`\mathbf{X}_{train \cup val}, \mathbf{y}_{train \cup val}`, using the best hyper parameters (:math:`\alpha_*`): :math:`\rightarrow f_{\alpha_*, \mathbf{\Omega}_*}(.)` **Model evaluation:** on the test set: :math:`L(f_{\alpha_*, \mathbf{\Omega}_*}(\mathbf{X}_{test}), \mathbf{y}_{test})` .. GENERATED FROM PYTHON SOURCE LINES 101-122 .. code-block:: default train_idx, validation_idx = train_test_split(np.arange(X_train.shape[0]), test_size=0.25, shuffle=True, random_state=42) split_inner = PredefinedSplit(test_fold=validation_idx) print("Train set size: %i" % X_train[train_idx].shape[0]) print("Validation set size: %i" % X_train[validation_idx].shape[0]) print("Test set size: %i" % X_test.shape[0]) lm_cv = GridSearchCV(lm.Ridge(), {'alpha': 10. ** np.arange(-3, 3)}, cv=split_inner, n_jobs=5) # Fit, indluding model selection with internal Train/validation split lm_cv.fit(X_train, y_train) # Predict y_pred_test = lm_cv.predict(X_test) print("Test R2: %.2f" % metrics.r2_score(y_test, y_pred_test)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Train set size: 56 Validation set size: 19 Test set size: 25 Test R2: 0.80 .. GENERATED FROM PYTHON SOURCE LINES 123-179 Cross-Validation (CV) --------------------- If sample size is limited, train/validation/test split may not be possible. **Cross Validation (CV)** can be used to replace train/validation split and/or train+validation / test split. Cross-Validation scheme randomly divides the set of observations into *K* groups, or **folds**, of approximately equal size. The first fold is treated as a validation set, and the method :math:`f()` is fitted on the remaining union of *K - 1* folds: (:math:`f(\boldsymbol{X}_{-K}, \boldsymbol{y}_{-K})`). The measure of performance (the score function :math:`\mathcal{S}`), either a error measure or an correct prediction measure is an average of a loss error or correct prediction measure, noted :math:`\mathcal{L}`, between a true target value and the predicted target value. The score function is evaluated of the on the observations in the held-out fold. For each sample *i* we consider the model estimated :math:`f(\boldsymbol{X}_{-k(i)}, \boldsymbol{y}_{-k(i)}` on the data set without the group *k* that contains *i* noted *-k(i)*. This procedure is repeated *K* times; each time, a different group of observations is treated as a test set. Then we compare the predicted value (:math:`f_{-k(i)}(\boldsymbol{x}_i) = \hat{y_i})` with true value :math:`y_i` using a Error or Loss function :math:`\mathcal{L}(y, \hat{y})`. For 10-fold we can either average over 10 values (Macro measure) or concatenate the 10 experiments and compute the micro measures. Two strategies [micro vs macro estimates](https://stats.stackexchange.com/questions/34611/meanscores-vs-scoreconcatenation-in-cross-validation): - **Micro measure: average(individual scores)**: compute a score :math:`\mathcal{S}` for each sample and average over all samples. It is simillar to **average score(concatenation)**: an averaged score computed over all concatenated samples. .. raw:: latex \mathcal{S}(f) = \frac{1}{N} \sum_i^N \mathcal{L}\left(y_i, f(\boldsymbol{x}_{-k(i)}, \boldsymbol{y}_{-k(i)}) \right). - **Macro measure mean(CV scores)** (the most commonly used method): compute a score :math:`\mathcal{S}` on each each fold *k* and average accross folds: .. raw:: latex \begin{align*} \mathcal{S}(f) &= \frac{1}{K} \sum_k^K \mathcal{S}_k(f).\\ \mathcal{S}(f) &= \frac{1}{K} \sum_k^K \frac{1}{N_k} \sum_{i \in k} \mathcal{L}\left(y_i, f(\boldsymbol{x}_{-k(i)}, \boldsymbol{y}_{-k(i)}) \right). \end{align*} These two measures (an average of average vs. a global average) are generaly similar. They may differ slightly is folds are of different sizes. This validation scheme is known as the **K-Fold CV**. Typical choices of *K* are 5 or 10, [Kohavi 1995]. The extreme case where *K = N* is known as **leave-one-out cross-validation, LOO-CV**. .. GENERATED FROM PYTHON SOURCE LINES 181-188 CV for regression ~~~~~~~~~~~~~~~~~ Usually the error function :math:`\mathcal{L}()` is the r-squared score. However other function (MAE, MSE) can be used. **CV with explicit loop:** .. GENERATED FROM PYTHON SOURCE LINES 188-204 .. code-block:: default from sklearn.model_selection import KFold estimator = lm.Ridge(alpha=10) cv = KFold(n_splits=5, shuffle=True, random_state=42) r2_train, r2_test = list(), list() for train, test in cv.split(X): estimator.fit(X[train, :], y[train]) r2_train.append(metrics.r2_score(y[train], estimator.predict(X[train, :]))) r2_test.append(metrics.r2_score(y[test], estimator.predict(X[test, :]))) print("Train r2:%.2f" % np.mean(r2_train)) print("Test r2:%.2f" % np.mean(r2_test)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Train r2:0.99 Test r2:0.67 .. GENERATED FROM PYTHON SOURCE LINES 205-208 Scikit-learn provides user-friendly function to perform CV: `cross_val_score()`: single metric .. GENERATED FROM PYTHON SOURCE LINES 208-219 .. code-block:: default from sklearn.model_selection import cross_val_score scores = cross_val_score(estimator=estimator, X=X, y=y, cv=5) print("Test r2:%.2f" % scores.mean()) cv = KFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(estimator=estimator, X=X, y=y, cv=cv) print("Test r2:%.2f" % scores.mean()) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Test r2:0.73 Test r2:0.67 .. GENERATED FROM PYTHON SOURCE LINES 220-221 `cross_validate()`: multi metric, + time, etc. .. GENERATED FROM PYTHON SOURCE LINES 221-231 .. code-block:: default from sklearn.model_selection import cross_validate scores = cross_validate(estimator=mod, X=X, y=y, cv=cv, scoring=['r2', 'neg_mean_absolute_error']) print("Test R2:%.2f; MAE:%.2f" % (scores['test_r2'].mean(), -scores['test_neg_mean_absolute_error'].mean())) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Test R2:0.67; MAE:55.27 .. GENERATED FROM PYTHON SOURCE LINES 232-244 CV for classification: stratifiy for the target label ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ With classification problems it is essential to sample folds where each set contains approximately the same percentage of samples of each target class as the complete set. This is called **stratification**. In this case, we will use ``StratifiedKFold`` with is a variation of k-fold which returns stratified folds. Usually the error function :math:`L()` are, at least, the sensitivity and the specificity. However other function could be used. **CV with explicit loop**: .. GENERATED FROM PYTHON SOURCE LINES 244-265 .. code-block:: default from sklearn.model_selection import StratifiedKFold X, y = datasets.make_classification(n_samples=100, n_features=100, shuffle=True, n_informative=10, random_state=42) mod = lm.LogisticRegression(C=1, solver='lbfgs') cv = StratifiedKFold(n_splits=5) # Lists to store scores by folds (for macro measure only) bacc, auc = [], [] for train, test in cv.split(X, y): mod.fit(X[train, :], y[train]) bacc.append(metrics.roc_auc_score(y[test], mod.decision_function(X[test, :]))) auc.append(metrics.balanced_accuracy_score(y[test], mod.predict(X[test, :]))) print("Test AUC:%.2f; bACC:%.2f" % (np.mean(bacc), np.mean(auc))) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Test AUC:0.86; bACC:0.80 .. GENERATED FROM PYTHON SOURCE LINES 266-267 `cross_val_score()`: single metric .. GENERATED FROM PYTHON SOURCE LINES 267-273 .. code-block:: default scores = cross_val_score(estimator=mod, X=X, y=y, cv=5) print("Test ACC:%.2f" % scores.mean()) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Test ACC:0.80 .. GENERATED FROM PYTHON SOURCE LINES 274-275 Provide your own CV and score .. GENERATED FROM PYTHON SOURCE LINES 275-284 .. code-block:: default def balanced_acc(estimator, X, y, **kwargs): """Balanced acuracy scorer.""" return metrics.recall_score(y, estimator.predict(X), average=None).mean() scores = cross_val_score(estimator=mod, X=X, y=y, cv=cv, scoring=balanced_acc) print("Test bACC:%.2f" % scores.mean()) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Test bACC:0.80 .. GENERATED FROM PYTHON SOURCE LINES 285-286 `cross_validate()`: multi metric, + time, etc. .. GENERATED FROM PYTHON SOURCE LINES 286-297 .. code-block:: default from sklearn.model_selection import cross_validate scores = cross_validate(estimator=mod, X=X, y=y, cv=cv, scoring=['balanced_accuracy', 'roc_auc']) print("Test AUC:%.2f; bACC:%.2f" % (scores['test_roc_auc'].mean(), scores['test_balanced_accuracy'].mean())) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Test AUC:0.86; bACC:0.80 .. GENERATED FROM PYTHON SOURCE LINES 298-304 Cross-validation for model selection ------------------------------------ Combine CV and grid search: Re-split (inner split) train set into CV folds train/validation folds and build a `GridSearchCV` out of it: .. GENERATED FROM PYTHON SOURCE LINES 304-323 .. code-block:: default # Outer split: X_train, X_test, y_train, y_test =\ train_test_split(X, y, test_size=0.25, shuffle=True, random_state=42) cv_inner = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) # Cross-validation for model selection lm_cv = GridSearchCV(lm.LogisticRegression(), {'C': 10. ** np.arange(-3, 3)}, cv=cv_inner, n_jobs=5) # Fit, indluding model selection with internal CV lm_cv.fit(X_train, y_train) # Predict y_pred_test = lm_cv.predict(X_test) print("Test bACC: %.2f" % metrics.balanced_accuracy_score(y_test, y_pred_test)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Test bACC: 0.63 .. GENERATED FROM PYTHON SOURCE LINES 324-326 Cross-validation for both model (outer) evaluation and model (inner) selection ------------------------------------------------------------------------------ .. GENERATED FROM PYTHON SOURCE LINES 326-342 .. code-block:: default cv_outer = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) cv_inner = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) # Cross-validation for model (inner) selection lm_cv = GridSearchCV(lm.Ridge(), {'alpha': 10. ** np.arange(-3, 3)}, cv=cv_inner, n_jobs=5) # Cross-validation for model (outer) evaluation scores = cross_validate(estimator=mod, X=X, y=y, cv=cv_outer, scoring=['balanced_accuracy', 'roc_auc']) print("Test AUC:%.2f; bACC:%.2f, Time: %.2fs" % (scores['test_roc_auc'].mean(), scores['test_balanced_accuracy'].mean(), scores['fit_time'].sum())) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Test AUC:0.85; bACC:0.74, Time: 0.03s .. GENERATED FROM PYTHON SOURCE LINES 343-349 Models with built-in cross-validation -------------------------------------- Let sklearn select the best parameters over a default grid. **Classification** .. GENERATED FROM PYTHON SOURCE LINES 349-356 .. code-block:: default print("== Logistic Ridge (L2 penalty) ==") mod_cv = lm.LogisticRegressionCV(class_weight='balanced', scoring='balanced_accuracy', n_jobs=-1, cv=5) scores = cross_val_score(estimator=mod_cv, X=X, y=y, cv=5) print("Test ACC:%.2f" % scores.mean()) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none == Logistic Ridge (L2 penalty) == Test ACC:0.78 .. GENERATED FROM PYTHON SOURCE LINES 357-358 **Regression** .. GENERATED FROM PYTHON SOURCE LINES 358-378 .. code-block:: default X, y, coef = datasets.make_regression(n_samples=50, n_features=100, noise=10, n_informative=2, random_state=42, coef=True) print("== Ridge (L2 penalty) ==") model = lm.RidgeCV(cv=3) scores = cross_val_score(estimator=model, X=X, y=y, cv=5) print("Test r2:%.2f" % scores.mean()) print("== Lasso (L1 penalty) ==") model = lm.LassoCV(n_jobs=-1, cv=3) scores = cross_val_score(estimator=model, X=X, y=y, cv=5) print("Test r2:%.2f" % scores.mean()) print("== ElasticNet (L1 penalty) ==") model = lm.ElasticNetCV(l1_ratio=[.1, .5, .9], n_jobs=-1, cv=3) scores = cross_val_score(estimator=model, X=X, y=y, cv=5) print("Test r2:%.2f" % scores.mean()) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none == Ridge (L2 penalty) == Test r2:0.16 == Lasso (L1 penalty) == Test r2:0.74 == ElasticNet (L1 penalty) == Test r2:0.58 .. GENERATED FROM PYTHON SOURCE LINES 379-393 Random Permutations: sample the null distribution ------------------------------------------------- A permutation test is a type of non-parametric randomization test in which the null distribution of a test statistic is estimated by randomly permuting the observations. Permutation tests are highly attractive because they make no assumptions other than that the observations are independent and identically distributed under the null hypothesis. 1. Compute a observed statistic :math:`t_{obs}` on the data. 2. Use randomization to compute the distribution of :math:`t` under the null hypothesis: Perform :math:`N` random permutation of the data. For each sample of permuted data, :math:`i` the data compute the statistic :math:`t_i`. This procedure provides the distribution of *t* under the null hypothesis :math:`H_0`: :math:`P(t \vert H_0)` 3. Compute the p-value = :math:`P(t>t_{obs} | H_0) \left\vert\{t_i > t_{obs}\}\right\vert`, where :math:`t_i's include :math:`t_{obs}`. Example Ridge regression Sample the distributions of r-squared and coefficients of ridge regression under the null hypothesis. Simulated dataset: .. GENERATED FROM PYTHON SOURCE LINES 393-406 .. code-block:: default # Regression dataset where first 2 features are predictives np.random.seed(0) n_features = 5 n_features_info = 2 n_samples = 100 X = np.random.randn(100, 5) beta = np.zeros(n_features) beta[:n_features_info] = 1 Xbeta = np.dot(X, beta) eps = np.random.randn(n_samples) y = Xbeta + eps .. GENERATED FROM PYTHON SOURCE LINES 407-409 Random permutations ------------------- .. GENERATED FROM PYTHON SOURCE LINES 409-439 .. code-block:: default # Fit model on all data (!! risk of overfit) model = lm.RidgeCV() model.fit(X, y) print("Coefficients on all data:") print(model.coef_) # Random permutation loop nperm = 1000 # !! Should be at least 1000 (to assess a p-value at 1%) scores_names = ["r2"] scores_perm = np.zeros((nperm + 1, len(scores_names))) coefs_perm = np.zeros((nperm + 1, X.shape[1])) scores_perm[0, :] = metrics.r2_score(y, model.predict(X)) coefs_perm[0, :] = model.coef_ orig_all = np.arange(X.shape[0]) for perm_i in range(1, nperm + 1): model.fit(X, np.random.permutation(y)) y_pred = model.predict(X).ravel() scores_perm[perm_i, :] = metrics.r2_score(y, y_pred) coefs_perm[perm_i, :] = model.coef_ # One-tailed empirical p-value pval_pred_perm = np.sum(scores_perm >= scores_perm[0]) / scores_perm.shape[0] pval_coef_perm = np.sum(coefs_perm >= coefs_perm[0, :], axis=0) / coefs_perm.shape[0] print("R2 p-value: %.3f" % pval_pred_perm) print("Coeficients p-values:", np.round(pval_coef_perm, 3)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Coefficients on all data: [ 1.02 1.06 0.21 -0.02 -0.05] R2 p-value: 0.001 Coeficients p-values: [0. 0. 0.1 0.57 0.63] .. GENERATED FROM PYTHON SOURCE LINES 440-442 Compute p-values corrected for multiple comparisons using FWER max-T (Westfall and Young, 1993) procedure. .. GENERATED FROM PYTHON SOURCE LINES 442-448 .. code-block:: default pval_coef_perm_tmax = np.array([np.sum(coefs_perm.max(axis=1) >= coefs_perm[0, j]) for j in range(coefs_perm.shape[1])]) / coefs_perm.shape[0] print("P-values with FWER (Westfall and Young) correction") print(pval_coef_perm_tmax) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none P-values with FWER (Westfall and Young) correction [0. 0. 0.41 0.98 0.99] .. GENERATED FROM PYTHON SOURCE LINES 449-452 Plot distribution of third coefficient under null-hypothesis Coeffitients 0 and 1 are significantly different from 0. .. GENERATED FROM PYTHON SOURCE LINES 452-480 .. code-block:: default def hist_pvalue(perms, ax, name): """Plot statistic distribution as histogram. Paramters --------- perms: 1d array, statistics under the null hypothesis. perms[0] is the true statistic . """ # Re-weight to obtain distribution pval = np.sum(perms >= perms[0]) / perms.shape[0] weights = np.ones(perms.shape[0]) / perms.shape[0] ax.hist([perms[perms >= perms[0]], perms], histtype='stepfilled', bins=100, label="p-val<%.3f" % pval, weights=[weights[perms >= perms[0]], weights]) ax.axvline(x=perms[0], color="k", linewidth=2)#, label="observed statistic") ax.set_ylabel(name) ax.legend() return ax n_coef = coefs_perm.shape[1] fig, axes = plt.subplots(n_coef, 1, figsize=(12, 9)) for i in range(n_coef): hist_pvalue( coefs_perm[:, i], axes[i], str(i)) _ = axes[-1].set_xlabel("Coefficient distribution under null hypothesis") .. image:: /auto_gallery/images/sphx_glr_ml_resampling_001.png :alt: ml resampling :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 481-488 Exercise Given the logistic regression presented above and its validation given a 5 folds CV. 1. Compute the p-value associated with the prediction accuracy measured with 5CV using a permutation test. 2. Compute the p-value associated with the prediction accuracy using a parametric test. .. GENERATED FROM PYTHON SOURCE LINES 490-507 Bootstrapping ------------- Bootstrapping is a statistical technique which consists in generating sample (called bootstrap samples) from an initial dataset of size *N* by randomly drawing with replacement *N* observations. It provides sub-samples with the same distribution than the original dataset. It aims to: 1. Assess the variability (standard error, [confidence intervals.](https://sebastianraschka.com/blog/2016/model-evaluation-selection-part2.html#the-bootstrap-method-and-empirical-confidence-intervals)) of performances scores or estimated parameters (see Efron et al. 1986). 2. Regularize model by fitting several models on bootstrap samples and averaging their predictions (see Baging and random-forest). A great advantage of bootstrap is its simplicity. It is a straightforward way to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution, such as percentile points, proportions, odds ratio, and correlation coefficients. 1. Perform :math:`B` sampling, with replacement, of the dataset. 2. For each sample :math:`i` fit the model and compute the scores. 3. Assess standard errors and confidence intervals of scores using the scores obtained on the :math:`B` resampled dataset. Or, average models predictions. References: [Efron B, Tibshirani R. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat Sci 1986;1:54–75](https://projecteuclid.org/download/pdf_1/euclid.ss/1177013815) .. GENERATED FROM PYTHON SOURCE LINES 507-526 .. code-block:: default # Bootstrap loop nboot = 100 # !! Should be at least 1000 scores_names = ["r2"] scores_boot = np.zeros((nboot, len(scores_names))) coefs_boot = np.zeros((nboot, X.shape[1])) orig_all = np.arange(X.shape[0]) for boot_i in range(nboot): boot_tr = np.random.choice(orig_all, size=len(orig_all), replace=True) boot_te = np.setdiff1d(orig_all, boot_tr, assume_unique=False) Xtr, ytr = X[boot_tr, :], y[boot_tr] Xte, yte = X[boot_te, :], y[boot_te] model.fit(Xtr, ytr) y_pred = model.predict(Xte).ravel() scores_boot[boot_i, :] = metrics.r2_score(yte, y_pred) coefs_boot[boot_i, :] = model.coef_ .. GENERATED FROM PYTHON SOURCE LINES 527-529 Compute Mean, SE, CI Coeffitients 0 and 1 are significantly different from 0. .. GENERATED FROM PYTHON SOURCE LINES 529-540 .. code-block:: default scores_boot = pd.DataFrame(scores_boot, columns=scores_names) scores_stat = scores_boot.describe(percentiles=[.975, .5, .025]) print("r-squared: Mean=%.2f, SE=%.2f, CI=(%.2f %.2f)" % tuple(scores_stat.loc[["mean", "std", "2.5%", "97.5%"], "r2"])) coefs_boot = pd.DataFrame(coefs_boot) coefs_stat = coefs_boot.describe(percentiles=[.975, .5, .025]) print("Coefficients distribution") print(coefs_stat) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none r-squared: Mean=0.59, SE=0.09, CI=(0.40 0.73) Coefficients distribution 0 1 2 3 4 count 100.00 100.00 1.00e+02 100.00 100.00 mean 1.02 1.05 2.12e-01 -0.02 -0.05 std 0.09 0.11 9.75e-02 0.10 0.11 min 0.63 0.82 -2.69e-03 -0.23 -0.27 2.5% 0.86 0.88 3.27e-02 -0.20 -0.23 50% 1.03 1.04 2.17e-01 -0.01 -0.06 97.5% 1.17 1.29 3.93e-01 0.15 0.14 max 1.20 1.45 4.33e-01 0.22 0.29 .. GENERATED FROM PYTHON SOURCE LINES 541-542 Plot coefficient distribution .. GENERATED FROM PYTHON SOURCE LINES 542-549 .. code-block:: default df = pd.DataFrame(coefs_boot) staked = pd.melt(df, var_name="Variable", value_name="Coef. distribution") sns.set_theme(style="whitegrid") ax = sns.violinplot(x="Variable", y="Coef. distribution", data=staked) _ = ax.axhline(0, ls='--', lw=2, color="black") .. image:: /auto_gallery/images/sphx_glr_ml_resampling_002.png :alt: ml resampling :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 550-554 Parallel computation with joblib -------------------------------- Dataset .. GENERATED FROM PYTHON SOURCE LINES 554-564 .. code-block:: default import numpy as np from sklearn import datasets import sklearn.linear_model as lm import sklearn.metrics as metrics from sklearn.model_selection import StratifiedKFold X, y = datasets.make_classification(n_samples=20, n_features=5, n_informative=2, random_state=42) cv = StratifiedKFold(n_splits=5) .. GENERATED FROM PYTHON SOURCE LINES 565-566 Use `cross_validate` function .. GENERATED FROM PYTHON SOURCE LINES 566-574 .. code-block:: default from sklearn.model_selection import cross_validate estimator = lm.LogisticRegression(C=1, solver='lbfgs') cv_results = cross_validate(estimator, X, y, cv=cv, n_jobs=5) print(np.mean(cv_results['test_score']), cv_results['test_score']) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 0.8 [0.5 0.5 1. 1. 1. ] .. GENERATED FROM PYTHON SOURCE LINES 575-578 Sequential computation If we want have full control of the operations performed within each fold (retrieve the models parameters, etc.). We would like to parallelize the folowing sequetial code: .. GENERATED FROM PYTHON SOURCE LINES 578-601 .. code-block:: default # In[22]: estimator = lm.LogisticRegression(C=1, solver='lbfgs') y_test_pred_seq = np.zeros(len(y)) # Store predictions in the original order coefs_seq = list() for train, test in cv.split(X, y): X_train, X_test, y_train, y_test = X[train, :], X[test, :], y[train], y[test] estimator.fit(X_train, y_train) y_test_pred_seq[test] = estimator.predict(X_test) coefs_seq.append(estimator.coef_) test_accs = [metrics.accuracy_score(y[test], y_test_pred_seq[test]) for train, test in cv.split(X, y)] print(np.mean(test_accs), test_accs) coefs_cv = np.array(coefs_seq) print(coefs_cv) print(coefs_cv.mean(axis=0)) print("Std Err of the coef") print(coefs_cv.std(axis=0) / np.sqrt(coefs_cv.shape[0])) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 0.8 [0.5, 0.5, 1.0, 1.0, 1.0] [[[-0.88 0.63 1.19 -0.31 -0.38]] [[-0.75 0.62 1.1 0.2 -0.4 ]] [[-0.96 0.51 1.12 0.08 -0.26]] [[-0.86 0.52 1.07 -0.11 -0.29]] [[-0.9 0.51 1.09 -0.25 -0.28]]] [[-0.87 0.56 1.11 -0.08 -0.32]] Std Err of the coef [[0.03 0.02 0.02 0.09 0.03]] .. GENERATED FROM PYTHON SOURCE LINES 602-604 Parallel computation with joblib -------------------------------- .. GENERATED FROM PYTHON SOURCE LINES 604-633 .. code-block:: default from joblib import Parallel, delayed from sklearn.base import is_classifier, clone def _split_fit_predict(estimator, X, y, train, test): X_train, X_test, y_train, y_test = X[train, :], X[test, :], y[train], y[test] estimator.fit(X_train, y_train) return [estimator.predict(X_test), estimator.coef_] estimator = lm.LogisticRegression(C=1, solver='lbfgs') parallel = Parallel(n_jobs=5) cv_ret = parallel( delayed(_split_fit_predict)( clone(estimator), X, y, train, test) for train, test in cv.split(X, y)) y_test_pred_cv, coefs_cv = zip(*cv_ret) # Retrieve predictions in the original order y_test_pred = np.zeros(len(y)) for i, (train, test) in enumerate(cv.split(X, y)): y_test_pred[test] = y_test_pred_cv[i] test_accs = [metrics.accuracy_score(y[test], y_test_pred[test]) for train, test in cv.split(X, y)] print(np.mean(test_accs), test_accs) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 0.8 [0.5, 0.5, 1.0, 1.0, 1.0] .. GENERATED FROM PYTHON SOURCE LINES 634-635 Test same predictions and same coeficients .. GENERATED FROM PYTHON SOURCE LINES 635-639 .. code-block:: default assert np.all(y_test_pred == y_test_pred_seq) assert np.allclose(np.array(coefs_cv).squeeze(), np.array(coefs_seq).squeeze()) .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 4.727 seconds) .. _sphx_glr_download_auto_gallery_ml_resampling.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: ml_resampling.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: ml_resampling.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_