.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_gallery/stat_univ_lab_brain-volume.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_gallery_stat_univ_lab_brain-volume.py: Lab: Brain volumes study ======================= The study provides the brain volumes of grey matter (gm), white matter (wm) and cerebrospinal fluid) (csf) of 808 anatomical MRI scans. .. GENERATED FROM PYTHON SOURCE LINES 10-12 Manipulate data --------------- .. GENERATED FROM PYTHON SOURCE LINES 14-18 Set the working directory within a directory called "brainvol" Create 2 subdirectories: `data` that will contain downloaded data and `reports` for results of the analysis. .. GENERATED FROM PYTHON SOURCE LINES 18-34 .. code-block:: default import os import os.path import pandas as pd import tempfile import urllib.request WD = os.path.join(tempfile.gettempdir(), "brainvol") os.makedirs(WD, exist_ok=True) #os.chdir(WD) # use cookiecutter file organization # https://drivendata.github.io/cookiecutter-data-science/ os.makedirs(os.path.join(WD, "data"), exist_ok=True) #os.makedirs("reports", exist_ok=True) .. GENERATED FROM PYTHON SOURCE LINES 35-43 **Fetch data** - Demographic data `demo.csv` (columns: `participant_id`, `site`, `group`, `age`, `sex`) and tissue volume data: `group` is Control or Patient. `site` is the recruiting site. - Gray matter volume `gm.csv` (columns: `participant_id`, `session`, `gm_vol`) - White matter volume `wm.csv` (columns: `participant_id`, `session`, `wm_vol`) - Cerebrospinal Fluid `csf.csv` (columns: `participant_id`, `session`, `csf_vol`) .. GENERATED FROM PYTHON SOURCE LINES 43-61 .. code-block:: default base_url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/brain_volumes/%s' data = dict() for file in ["demo.csv", "gm.csv", "wm.csv", "csf.csv"]: urllib.request.urlretrieve(base_url % file, os.path.join(WD, "data", file)) # Read all CSV in one line # dicts = {k: pd.read_csv(os.path.join(WD, "data", "%s.csv" % k)) # for k in ["demo", "gm", "wm", "csf"]} demo = pd.read_csv(os.path.join(WD, "data", "demo.csv")) gm = pd.read_csv(os.path.join(WD, "data", "gm.csv")) wm = pd.read_csv(os.path.join(WD, "data", "wm.csv")) csf = pd.read_csv(os.path.join(WD, "data", "csf.csv")) print("tables can be merge using shared columns") print(gm.head()) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none tables can be merge using shared columns participant_id session gm_vol 0 sub-S1-0002 ses-01 0.67 1 sub-S1-0002 ses-02 0.68 2 sub-S1-0002 ses-03 0.67 3 sub-S1-0004 ses-01 0.89 4 sub-S1-0004 ses-02 0.88 .. GENERATED FROM PYTHON SOURCE LINES 62-63 **Merge tables** according to `participant_id` .. GENERATED FROM PYTHON SOURCE LINES 63-67 .. code-block:: default brain_vol = pd.merge(pd.merge(pd.merge(demo, gm), wm), csf) assert brain_vol.shape == (808, 9) .. GENERATED FROM PYTHON SOURCE LINES 68-69 **Drop rows with missing values** .. GENERATED FROM PYTHON SOURCE LINES 69-73 .. code-block:: default brain_vol = brain_vol.dropna() assert brain_vol.shape == (766, 9) .. GENERATED FROM PYTHON SOURCE LINES 74-76 **Compute Total Intra-cranial volume** `tiv_vol` = `gm_vol` + `csf_vol` + `wm_vol`. .. GENERATED FROM PYTHON SOURCE LINES 76-79 .. code-block:: default brain_vol["tiv_vol"] = brain_vol["gm_vol"] + brain_vol["wm_vol"] + brain_vol["csf_vol"] .. GENERATED FROM PYTHON SOURCE LINES 80-82 **Compute tissue fractions** `gm_f = gm_vol / tiv_vol`, `wm_f = wm_vol / tiv_vol`. .. GENERATED FROM PYTHON SOURCE LINES 82-86 .. code-block:: default brain_vol["gm_f"] = brain_vol["gm_vol"] / brain_vol["tiv_vol"] brain_vol["wm_f"] = brain_vol["wm_vol"] / brain_vol["tiv_vol"] .. GENERATED FROM PYTHON SOURCE LINES 87-88 **Save in a excel file** `brain_vol.xlsx` .. GENERATED FROM PYTHON SOURCE LINES 88-92 .. code-block:: default brain_vol.to_excel(os.path.join(WD, "data", "brain_vol.xlsx"), sheet_name='data', index=False) .. GENERATED FROM PYTHON SOURCE LINES 93-95 Descriptive Statistics ---------------------- .. GENERATED FROM PYTHON SOURCE LINES 97-98 Load excel file `brain_vol.xlsx` .. GENERATED FROM PYTHON SOURCE LINES 98-111 .. code-block:: default import os import pandas as pd import seaborn as sns import statsmodels.formula.api as smfrmla import statsmodels.api as sm brain_vol = pd.read_excel(os.path.join(WD, "data", "brain_vol.xlsx"), sheet_name='data') # Round float at 2 decimals when printing pd.options.display.float_format = '{:,.2f}'.format .. GENERATED FROM PYTHON SOURCE LINES 112-115 **Descriptive statistics** Most of participants have several MRI sessions (column `session`) Select on rows from session one "ses-01" .. GENERATED FROM PYTHON SOURCE LINES 115-121 .. code-block:: default brain_vol1 = brain_vol[brain_vol.session == "ses-01"] # Check that there are no duplicates assert len(brain_vol1.participant_id.unique()) == len(brain_vol1.participant_id) .. GENERATED FROM PYTHON SOURCE LINES 122-123 Global descriptives statistics of numerical variables .. GENERATED FROM PYTHON SOURCE LINES 123-128 .. code-block:: default desc_glob_num = brain_vol1.describe() print(desc_glob_num) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none age gm_vol wm_vol csf_vol tiv_vol gm_f wm_f count 244.00 244.00 244.00 244.00 244.00 244.00 244.00 mean 34.54 0.71 0.44 0.31 1.46 0.49 0.30 std 12.09 0.08 0.07 0.08 0.17 0.04 0.03 min 18.00 0.48 0.05 0.12 0.83 0.37 0.06 25% 25.00 0.66 0.40 0.25 1.34 0.46 0.28 50% 31.00 0.70 0.43 0.30 1.45 0.49 0.30 75% 44.00 0.77 0.48 0.37 1.57 0.52 0.31 max 61.00 1.03 0.62 0.63 2.06 0.60 0.36 .. GENERATED FROM PYTHON SOURCE LINES 129-130 Global Descriptive statistics of categorical variable .. GENERATED FROM PYTHON SOURCE LINES 130-140 .. code-block:: default desc_glob_cat = brain_vol1[["site", "group", "sex"]].describe(include='all') print(desc_glob_cat) print("Get count by level") desc_glob_cat = pd.DataFrame({col: brain_vol1[col].value_counts().to_dict() for col in ["site", "group", "sex"]}) print(desc_glob_cat) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none site group sex count 244 244 244 unique 7 2 2 top S7 Patient M freq 65 157 155 Get count by level site group sex S7 65.00 nan nan S5 62.00 nan nan S8 59.00 nan nan S3 29.00 nan nan S4 15.00 nan nan S1 13.00 nan nan S6 1.00 nan nan Patient nan 157.00 nan Control nan 87.00 nan M nan nan 155.00 F nan nan 89.00 .. GENERATED FROM PYTHON SOURCE LINES 141-142 Remove the single participant from site 6 .. GENERATED FROM PYTHON SOURCE LINES 142-150 .. code-block:: default brain_vol = brain_vol[brain_vol.site != "S6"] brain_vol1 = brain_vol[brain_vol.session == "ses-01"] desc_glob_cat = pd.DataFrame({col: brain_vol1[col].value_counts().to_dict() for col in ["site", "group", "sex"]}) print(desc_glob_cat) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none site group sex S7 65.00 nan nan S5 62.00 nan nan S8 59.00 nan nan S3 29.00 nan nan S4 15.00 nan nan S1 13.00 nan nan Patient nan 157.00 nan Control nan 86.00 nan M nan nan 155.00 F nan nan 88.00 .. GENERATED FROM PYTHON SOURCE LINES 151-152 Descriptives statistics of numerical variables per clinical status .. GENERATED FROM PYTHON SOURCE LINES 152-156 .. code-block:: default desc_group_num = brain_vol1[["group", 'gm_vol']].groupby("group").describe() print(desc_group_num) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none gm_vol count mean std min 25% 50% 75% max group Control 86.00 0.72 0.09 0.48 0.66 0.71 0.78 1.03 Patient 157.00 0.70 0.08 0.53 0.65 0.70 0.76 0.90 .. GENERATED FROM PYTHON SOURCE LINES 157-169 Statistics ---------- Objectives: 1. Site effect of gray matter atrophy 2. Test the association between the age and gray matter atrophy in the control and patient population independently. 3. Test for differences of atrophy between the patients and the controls 4. Test for interaction between age and clinical status, ie: is the brain atrophy process in patient population faster than in the control population. 5. The effect of the medication in the patient population. .. GENERATED FROM PYTHON SOURCE LINES 169-175 .. code-block:: default import statsmodels.api as sm import statsmodels.formula.api as smfrmla import scipy.stats import seaborn as sns .. GENERATED FROM PYTHON SOURCE LINES 176-187 **1 Site effect on Grey Matter atrophy** The model is Oneway Anova gm_f ~ site The ANOVA test has important assumptions that must be satisfied in order for the associated p-value to be valid. - The samples are independent. - Each sample is from a normally distributed population. - The population standard deviations of the groups are all equal. This property is known as homoscedasticity. .. GENERATED FROM PYTHON SOURCE LINES 189-190 Plot .. GENERATED FROM PYTHON SOURCE LINES 190-193 .. code-block:: default sns.violinplot(x="site", y="gm_f", data=brain_vol1) # sns.violinplot(x="site", y="wm_f", data=brain_vol1) .. image:: /auto_gallery/images/sphx_glr_stat_univ_lab_brain-volume_001.png :alt: stat univ lab brain volume :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out Out: .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 194-195 Stats with scipy .. GENERATED FROM PYTHON SOURCE LINES 195-200 .. code-block:: default fstat, pval = scipy.stats.f_oneway(*[brain_vol1.gm_f[brain_vol1.site == s] for s in brain_vol1.site.unique()]) print("Oneway Anova gm_f ~ site F=%.2f, p-value=%E" % (fstat, pval)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Oneway Anova gm_f ~ site F=14.82, p-value=1.188136E-12 .. GENERATED FROM PYTHON SOURCE LINES 201-202 Stats with statsmodels .. GENERATED FROM PYTHON SOURCE LINES 202-210 .. code-block:: default anova = smfrmla.ols("gm_f ~ site", data=brain_vol1).fit() # print(anova.summary()) print("Site explains %.2f%% of the grey matter fraction variance" % (anova.rsquared * 100)) print(sm.stats.anova_lm(anova, typ=2)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Site explains 23.82% of the grey matter fraction variance sum_sq df F PR(>F) site 0.11 5.00 14.82 0.00 Residual 0.35 237.00 nan nan .. GENERATED FROM PYTHON SOURCE LINES 211-213 **2. Test the association between the age and gray matter atrophy** in the control and patient population independently. .. GENERATED FROM PYTHON SOURCE LINES 215-216 Plot .. GENERATED FROM PYTHON SOURCE LINES 216-222 .. code-block:: default sns.lmplot(x="age", y="gm_f", hue="group", data=brain_vol1) brain_vol1_ctl = brain_vol1[brain_vol1.group == "Control"] brain_vol1_pat = brain_vol1[brain_vol1.group == "Patient"] .. image:: /auto_gallery/images/sphx_glr_stat_univ_lab_brain-volume_002.png :alt: stat univ lab brain volume :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 223-224 Stats with scipy .. GENERATED FROM PYTHON SOURCE LINES 224-243 .. code-block:: default print("--- In control population ---") beta, beta0, r_value, p_value, std_err = \ scipy.stats.linregress(x=brain_vol1_ctl.age, y=brain_vol1_ctl.gm_f) print("gm_f = %f * age + %f" % (beta, beta0)) print("Corr: %f, r-squared: %f, p-value: %f, std_err: %f"\ % (r_value, r_value**2, p_value, std_err)) print("--- In patient population ---") beta, beta0, r_value, p_value, std_err = \ scipy.stats.linregress(x=brain_vol1_pat.age, y=brain_vol1_pat.gm_f) print("gm_f = %f * age + %f" % (beta, beta0)) print("Corr: %f, r-squared: %f, p-value: %f, std_err: %f"\ % (r_value, r_value**2, p_value, std_err)) print("Decrease seems faster in patient than in control population") .. rst-class:: sphx-glr-script-out Out: .. code-block:: none --- In control population --- gm_f = -0.001181 * age + 0.529829 Corr: -0.325122, r-squared: 0.105704, p-value: 0.002255, std_err: 0.000375 --- In patient population --- gm_f = -0.001899 * age + 0.556886 Corr: -0.528765, r-squared: 0.279592, p-value: 0.000000, std_err: 0.000245 Decrease seems faster in patient than in control population .. GENERATED FROM PYTHON SOURCE LINES 244-245 Stats with statsmodels .. GENERATED FROM PYTHON SOURCE LINES 245-258 .. code-block:: default print("--- In control population ---") lr = smfrmla.ols("gm_f ~ age", data=brain_vol1_ctl).fit() print(lr.summary()) print("Age explains %.2f%% of the grey matter fraction variance" % (lr.rsquared * 100)) print("--- In patient population ---") lr = smfrmla.ols("gm_f ~ age", data=brain_vol1_pat).fit() print(lr.summary()) print("Age explains %.2f%% of the grey matter fraction variance" % (lr.rsquared * 100)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none --- In control population --- OLS Regression Results =============================================================================== Dep. Variable: gm_f R-squared: 0.106 Model: OLS Adj. R-squared: 0.095 Method: Least Squares F-statistic: 9.929 Date: ven., 08 janv. 2021 Prob (F-statistic): 0.00226 Time: 16:12:34 Log-Likelihood: 159.34 No. Observations: 86 AIC: -314.7 Df Residuals: 84 BIC: -309.8 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 0.5298 0.013 40.350 0.000 0.504 0.556 age -0.0012 0.000 -3.151 0.002 -0.002 -0.000 ============================================================================== Omnibus: 0.946 Durbin-Watson: 1.628 Prob(Omnibus): 0.623 Jarque-Bera (JB): 0.782 Skew: 0.233 Prob(JB): 0.676 Kurtosis: 2.962 Cond. No. 111. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. Age explains 10.57% of the grey matter fraction variance --- In patient population --- OLS Regression Results =============================================================================== Dep. Variable: gm_f R-squared: 0.280 Model: OLS Adj. R-squared: 0.275 Method: Least Squares F-statistic: 60.16 Date: ven., 08 janv. 2021 Prob (F-statistic): 1.09e-12 Time: 16:12:34 Log-Likelihood: 289.38 No. Observations: 157 AIC: -574.8 Df Residuals: 155 BIC: -568.7 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 0.5569 0.009 60.817 0.000 0.539 0.575 age -0.0019 0.000 -7.756 0.000 -0.002 -0.001 ============================================================================== Omnibus: 2.310 Durbin-Watson: 1.325 Prob(Omnibus): 0.315 Jarque-Bera (JB): 1.854 Skew: 0.230 Prob(JB): 0.396 Kurtosis: 3.268 Cond. No. 111. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. Age explains 27.96% of the grey matter fraction variance .. GENERATED FROM PYTHON SOURCE LINES 259-262 Before testing for differences of atrophy between the patients ans the controls **Preliminary tests for age x group effect** (patients would be older or younger than Controls) .. GENERATED FROM PYTHON SOURCE LINES 264-265 Plot .. GENERATED FROM PYTHON SOURCE LINES 265-268 .. code-block:: default sns.violinplot(x="group", y="age", data=brain_vol1) .. image:: /auto_gallery/images/sphx_glr_stat_univ_lab_brain-volume_003.png :alt: stat univ lab brain volume :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out Out: .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 269-270 Stats with scipy .. GENERATED FROM PYTHON SOURCE LINES 270-273 .. code-block:: default print(scipy.stats.ttest_ind(brain_vol1_ctl.age, brain_vol1_pat.age)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Ttest_indResult(statistic=-1.2155557697674162, pvalue=0.225343592508479) .. GENERATED FROM PYTHON SOURCE LINES 274-275 Stats with statsmodels .. GENERATED FROM PYTHON SOURCE LINES 275-279 .. code-block:: default print(smfrmla.ols("age ~ group", data=brain_vol1).fit().summary()) print("No significant difference in age between patients and controls") .. rst-class:: sphx-glr-script-out Out: .. code-block:: none OLS Regression Results =============================================================================== Dep. Variable: age R-squared: 0.006 Model: OLS Adj. R-squared: 0.002 Method: Least Squares F-statistic: 1.478 Date: ven., 08 janv. 2021 Prob (F-statistic): 0.225 Time: 16:12:34 Log-Likelihood: -949.69 No. Observations: 243 AIC: 1903. Df Residuals: 241 BIC: 1910. Df Model: 1 Covariance Type: nonrobust ==================================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------ Intercept 33.2558 1.305 25.484 0.000 30.685 35.826 group[T.Patient] 1.9735 1.624 1.216 0.225 -1.225 5.172 ============================================================================== Omnibus: 35.711 Durbin-Watson: 2.096 Prob(Omnibus): 0.000 Jarque-Bera (JB): 20.726 Skew: 0.569 Prob(JB): 3.16e-05 Kurtosis: 2.133 Cond. No. 3.12 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. No significant difference in age between patients and controls .. GENERATED FROM PYTHON SOURCE LINES 280-282 **Preliminary tests for sex x group** (more/less males in patients than in Controls) .. GENERATED FROM PYTHON SOURCE LINES 282-292 .. code-block:: default crosstab = pd.crosstab(brain_vol1.sex, brain_vol1.group) print("Obeserved contingency table") print(crosstab) chi2, pval, dof, expected = scipy.stats.chi2_contingency(crosstab) print("Chi2 = %f, pval = %f" % (chi2, pval)) print("No significant difference in sex between patients and controls") .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Obeserved contingency table group Control Patient sex F 33 55 M 53 102 Chi2 = 0.143253, pval = 0.705068 No significant difference in sex between patients and controls .. GENERATED FROM PYTHON SOURCE LINES 293-294 **3. Test for differences of atrophy between the patients and the controls** .. GENERATED FROM PYTHON SOURCE LINES 294-299 .. code-block:: default print(sm.stats.anova_lm(smfrmla.ols("gm_f ~ group", data=brain_vol1).fit(), typ=2)) print("No significant difference in atrophy between patients and controls") .. rst-class:: sphx-glr-script-out Out: .. code-block:: none sum_sq df F PR(>F) group 0.00 1.00 0.01 0.92 Residual 0.46 241.00 nan nan No significant difference in atrophy between patients and controls .. GENERATED FROM PYTHON SOURCE LINES 300-301 This model is simplistic we should adjust for age and site .. GENERATED FROM PYTHON SOURCE LINES 301-306 .. code-block:: default print(sm.stats.anova_lm(smfrmla.ols( "gm_f ~ group + age + site", data=brain_vol1).fit(), typ=2)) print("No significant difference in GM between patients and controls") .. rst-class:: sphx-glr-script-out Out: .. code-block:: none sum_sq df F PR(>F) group 0.00 1.00 1.82 0.18 site 0.11 5.00 19.79 0.00 age 0.09 1.00 86.86 0.00 Residual 0.25 235.00 nan nan No significant difference in GM between patients and controls .. GENERATED FROM PYTHON SOURCE LINES 307-308 Observe age effect .. GENERATED FROM PYTHON SOURCE LINES 310-312 **4. Test for interaction between age and clinical status**, ie: is the brain atrophy process in patient population faster than in the control population. .. GENERATED FROM PYTHON SOURCE LINES 312-324 .. code-block:: default ancova = smfrmla.ols("gm_f ~ group:age + age + site", data=brain_vol1).fit() print(sm.stats.anova_lm(ancova, typ=2)) print("= Parameters =") print(ancova.params) print("%.3f%% of grey matter loss per year (almost %.1f%% per decade)" % (ancova.params.age * 100, ancova.params.age * 100 * 10)) print("grey matter loss in patients is accelerated by %.3f%% per decade" % (ancova.params['group[T.Patient]:age'] * 100 * 10)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none sum_sq df F PR(>F) site 0.11 5.00 20.28 0.00 age 0.10 1.00 89.37 0.00 group:age 0.00 1.00 3.28 0.07 Residual 0.25 235.00 nan nan = Parameters = Intercept 0.52 site[T.S3] 0.01 site[T.S4] 0.03 site[T.S5] 0.01 site[T.S7] 0.06 site[T.S8] 0.02 age -0.00 group[T.Patient]:age -0.00 dtype: float64 -0.148% of grey matter loss per year (almost -1.5% per decade) grey matter loss in patients is accelerated by -0.232% per decade .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 3.007 seconds) .. _sphx_glr_download_auto_gallery_stat_univ_lab_brain-volume.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: stat_univ_lab_brain-volume.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: stat_univ_lab_brain-volume.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_