Statistics and Machine Learning in Python¶
Edouard Duchesnay, Tommy Löfstedt, Feki Younes
Introduction¶
Important links:
This document describes statistics and machine learning in Python using:
Scikit-learn for machine learning.
Pytorch for deep learning.
Statsmodels for statistics.
Python language¶
- Import libraries
- Basic operations
- Data types
- Execution control statements
- List comprehensions, iterators, etc.
- Functions
- Regular expression
- System programming
- Scripts and argument parsing
- Networking
- Modules and packages
- Object Oriented Programming (OOP)
- Style guide for Python programming
- Documenting
- Exercises
Scientific Python¶
Statistics¶
- Univariate statistics
- Libraries
- Estimators of the main statistical measures
- Main distributions
- Hypothesis Testing
- Testing pairwise associations
- Pearson correlation test: test association between two quantitative variables
- Two sample (Student) \(t\)-test: compare two means
- ANOVA \(F\)-test (quantitative ~ categorial (>=2 levels))
- Chi-square, \(\chi^2\) (categorial ~ categorial)
- Non-parametric test of pairwise associations
- Linear model
- Linear model with statsmodels
- Multiple comparisons
- Lab: Brain volumes study
- Linear Mixed Models
- Multivariate statistics
- Time series in python
- Stationarity
- Pandas time series data structure
- Time series analysis of Google trends
- Read data
- Recode data
- Exploratory data analysis
- Resampling, smoothing, windowing, rolling average: trends
- First-order differencing: seasonal patterns
- Periodicity and correlation
- Autocorrelation
- Time series forecasting with Python using Autoregressive Moving Average (ARMA) models
Machine Learning¶
- Linear dimension reduction and feature extraction
- Manifold learning: non-linear dimension reduction
- Clustering
- Linear models for regression problems
- Ordinary least squares
- Linear regression with scikit-learn
- Overfitting
- Regularization using penalization of coefficients
- Ridge regression (\(\ell_2\)-regularization)
- Lasso regression (\(\ell_1\)-regularization)
- Elastic-net regression (\(\ell_1\)-\(\ell_2\)-regularization)
- Regression performance evaluation metrics: R-squared, MSE and MAE
- Linear models for classification problems
- Fisher’s linear discriminant with equal class covariance
- Linear discriminant analysis (LDA)
- Logistic regression
- Losses
- Overfitting
- Regularization using penalization of coefficients
- Ridge Fisher’s linear classification (\(\ell_2\)-regularization)
- Ridge logistic regression (\(\ell_2\)-regularization)
- Lasso logistic regression (\(\ell_1\)-regularization)
- Ridge linear Support Vector Machine (\(\ell_2\)-regularization)
- Lasso linear Support Vector Machine (\(\ell_1\)-regularization)
- Elastic-net classification (\(\ell_1\ell_2\)-regularization)
- Classification performance evaluation metrics
- Imbalanced classes
- Confidence interval cross-validation
- Exercise
- Non-linear models
- Resampling methods
- Train, validation and test sets
- Split dataset in train/test sets for model evaluation
- Train/validation/test splits: model selection and model evaluation
- Cross-Validation (CV)
- Cross-validation for model selection
- Cross-validation for both model (outer) evaluation and model (inner) selection
- Models with built-in cross-validation
- Random Permutations: sample the null distribution
- Random permutations
- Bootstrapping
- Parallel computation with joblib
- Parallel computation with joblib
- Ensemble learning: bagging, boosting and stacking
- Gradient descent
- Lab: Faces recognition using various learning models
- Utils
- Download the data
- Split into a training and testing set in stratified way
- Eigenfaces
- LogisticRegression with L2 penalty (with CV-based model selection)
- SVM (with CV-based model selection)
- MLP with sklearn and CV-based model selection
- MLP with pytorch and no model selection
- Univariate feature filtering (Anova) with Logistic-L2
- PCA with LogisticRegression with L2 regularization
- Basic ConvNet
- ConvNet with Resnet18