Linear dimension reduction and feature extraction
=================================================
Introduction
------------
In machine learning and statistics, dimensionality reduction or
dimension reduction is the process of reducing the number of features
under consideration, and can be divided into feature selection (not
addressed here) and feature extraction.
Feature extraction starts from an initial set of measured data and
builds derived values (features) intended to be informative and
non-redundant, facilitating the subsequent learning and generalization
steps, and in some cases leading to better human interpretations.
Feature extraction is related to dimensionality reduction.
The input matrix :math:`\mathbf{X}`, of dimension :math:`N \times P`, is
.. math::
\begin{bmatrix}
x_{11} & \ldots & x_{1P}\\
& & \\
\vdots & \mathbf{X} & \vdots\\
& & \\
x_{N1} & \ldots & x_{NP}
\end{bmatrix}
where the rows represent the samples and columns represent the
variables. The goal is to learn a transformation that extracts a few
relevant features.
Models:
1. Linear matrix decomposition/factorisation SVD/PCA. Those models
exploit the covariance :math:`\mathbf{\Sigma_{XX}}` between the input
features.
2. Non-linear models based on manifold learning: Isomap, t-SNE. Those
models
Singular value decomposition and matrix factorization
-----------------------------------------------------
Matrix factorization principles
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Decompose the data matrix :math:`\mathbf{X}_{N \times P}` into a product
of a mixing matrix :math:`\mathbf{U}_{N \times K}` and a dictionary
matrix :math:`\mathbf{V}_{P \times K}`.
.. math::
\mathbf{X} = \mathbf{U} \mathbf{V}^{T},
If we consider only a subset of components
:math:`K`__
- `Principal Component Analysis in 3 Simple
Steps `__
Principles
~~~~~~~~~~
- Principal components analysis is the main method used for linear
dimension reduction.
- The idea of principal component analysis is to find the :math:`K`
**principal components directions** (called the **loadings**)
:math:`\mathbf{V}_{K\times P}` that capture the variation in the data
as much as possible.
- It converts a set of :math:`N` :math:`P`-dimensional observations
:math:`\mathbf{N}_{N\times P}` of possibly correlated variables into
a set of :math:`N` :math:`K`-dimensional samples
:math:`\mathbf{C}_{N\times K}`, where the :math:`K < P`. The new
variables are linearly uncorrelated. The columns of
:math:`\mathbf{C}_{N\times K}` are called the **principal
components**.
- The dimension reduction is obtained by using only :math:`K < P`
components that exploit correlation (covariance) among the original
variables.
- PCA is mathematically defined as an orthogonal linear transformation
:math:`\mathbf{V}_{K\times P}` that transforms the data to a new
coordinate system such that the greatest variance by some projection
of the data comes to lie on the first coordinate (called the first
principal component), the second greatest variance on the second
coordinate, and so on.
.. math::
\mathbf{C}_{N\times K} = \mathbf{X}_{N \times P} \mathbf{V}_{P \times K}
- PCA can be thought of as fitting a :math:`P`-dimensional ellipsoid to
the data, where each axis of the ellipsoid represents a principal
component. If some axis of the ellipse is small, then the variance
along that axis is also small, and by omitting that axis and its
corresponding principal component from our representation of the
dataset, we lose only a commensurately small amount of information.
- Finding the :math:`K` largest axes of the ellipse will permit to
project the data onto a space having dimensionality :math:`K < P`
while maximizing the variance of the projected data.
Dataset preprocessing
~~~~~~~~~~~~~~~~~~~~~
Centering
^^^^^^^^^
Consider a data matrix, :math:`\mathbf{X}` , with column-wise zero
empirical mean (the sample mean of each column has been shifted to
zero), ie. :math:`\mathbf{X}` is replaced by
:math:`\mathbf{X} - \mathbf{1}\bar{\mathbf{x}}^T`.
Standardizing
^^^^^^^^^^^^^
Optionally, standardize the columns, i.e., scale them by their
standard-deviation. Without standardization, a variable with a high
variance will capture most of the effect of the PCA. The principal
direction will be aligned with this variable. Standardization will,
however, raise noise variables to the save level as informative
variables.
The covariance matrix of centered standardized data is the correlation
matrix.
Eigendecomposition of the data covariance matrix
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To begin with, consider the projection onto a one-dimensional space
(:math:`K = 1`). We can define the direction of this space using a
:math:`P`-dimensional vector :math:`\mathbf{v}`, which for convenience
(and without loss of generality) we shall choose to be a unit vector so
that :math:`\|\mathbf{v}\|_2 = 1` (note that we are only interested in
the direction defined by :math:`\mathbf{v}`, not in the magnitude of
:math:`\mathbf{v}` itself). PCA consists of two mains steps:
**Projection in the directions that capture the greatest variance**
Each :math:`P`-dimensional data point :math:`\mathbf{x}_i` is then
projected onto :math:`\mathbf{v}`, where the coordinate (in the
coordinate system of :math:`\mathbf{v}`) is a scalar value, namely
:math:`\mathbf{x}_i^T \mathbf{v}`. I.e., we want to find the vector
:math:`\mathbf{v}` that maximizes these coordinates along
:math:`\mathbf{v}`, which we will see corresponds to maximizing the
variance of the projected data. This is equivalently expressed as
.. math::
\mathbf{v} = \arg \max_{\|\mathbf{v}\|=1}\frac{1}{N}\sum_i \left(\mathbf{x}_i^T \mathbf{v}\right)^2.
We can write this in matrix form as
.. math::
\mathbf{v} = \arg \max_{\|\mathbf{v}\|=1} \frac{1}{N} \|\mathbf{X} \mathbf{v}\|^2 = \frac{1}{N} \mathbf{v}^T \mathbf{X}^T \mathbf{X} \mathbf{v} = \mathbf{v}^T\mathbf{S_{XX}}\mathbf{v},
where :math:`\mathbf{S_{XX}}` is a biased estiamte of the covariance
matrix of the data, i.e.
.. math::
\mathbf{S_{XX}} = \frac{1}{N} \mathbf{X}^T\mathbf{X}.
We now maximize the projected variance
:math:`\mathbf{v}^T \mathbf{S_{XX}} \mathbf{v}` with respect to
:math:`\mathbf{v}`. Clearly, this has to be a constrained maximization
to prevent :math:`\|\mathbf{v}_2\| \rightarrow \infty`. The appropriate
constraint comes from the normalization condition
:math:`\|\mathbf{v}\|_2 \equiv \|\mathbf{v}\|_2^2 = \mathbf{v}^T \mathbf{v} = 1`.
To enforce this constraint, we introduce a `Lagrange
multiplier `__
that we shall denote by :math:`\lambda`, and then make an unconstrained
maximization of
.. math::
\mathbf{v}^T\mathbf{S_{XX}} \mathbf{v} - \lambda (\mathbf{v}^T \mathbf{v} - 1).
By setting the gradient with respect to :math:`\mathbf{v}` equal to
zero, we see that this quantity has a stationary point when
.. math::
\mathbf{S_{XX}} \mathbf{v} = \lambda \mathbf{v}.
We note that :math:`\mathbf{v}` is an eigenvector of
:math:`\mathbf{S_{XX}}`.
If we left-multiply the above equation by :math:`\mathbf{v}^T` and make
use of :math:`\mathbf{v}^T \mathbf{v} = 1`, we see that the variance is
given by
.. math::
\mathbf{v}^T \mathbf{S_{XX}} \mathbf{v} = \lambda,
and so the variance will be at a maximum when :math:`\mathbf{v}` is
equal to the eigenvector corresponding to the largest eigenvalue,
:math:`\lambda`. This eigenvector is known as the first principal
component.
We can define additional principal components in an incremental fashion
by choosing each new direction to be that which maximizes the projected
variance amongst all possible directions that are orthogonal to those
already considered. If we consider the general case of a
:math:`K`-dimensional projection space, the optimal linear projection
for which the variance of the projected data is maximized is now defined
by the :math:`K` eigenvectors,
:math:`\mathbf{v_1}, \ldots , \mathbf{v_K}`, of the data covariance
matrix :math:`\mathbf{S_{XX}}` that corresponds to the :math:`K` largest
eigenvalues,
:math:`\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_K`.
Back to SVD
^^^^^^^^^^^
The sample covariance matrix of **centered data** :math:`\mathbf{X}` is
given by
.. math::
\mathbf{S_{XX}} = \frac{1}{N-1}\mathbf{X}^T\mathbf{X}.
We rewrite :math:`\mathbf{X}^T\mathbf{X}` using the SVD decomposition of
:math:`\mathbf{X}` as
.. raw:: latex
\begin{align*}
\mathbf{X}^T\mathbf{X}
&= (\mathbf{U}\mathbf{D}\mathbf{V}^T)^T(\mathbf{U}\mathbf{D}\mathbf{V}^T)\\
&= \mathbf{V}\mathbf{D}^T\mathbf{U}^T\mathbf{U}\mathbf{D}\mathbf{V}^T\\
&=\mathbf{V}\mathbf{D}^2\mathbf{V}^T\\
\mathbf{V}^T\mathbf{X}^T\mathbf{X}\mathbf{V} &= \mathbf{D}^2\\
\frac{1}{N-1} \mathbf{V}^T\mathbf{X}^T\mathbf{X}\mathbf{V} &= \frac{1}{N-1}\mathbf{D}^2\\
\mathbf{V}^T\mathbf{S_{XX}}\mathbf{V} &= \frac{1}{N-1}\mathbf{D}^2
\end{align*}.
Considering only the :math:`k^{th}` right-singular vectors
:math:`\mathbf{v}_k` associated to the singular value :math:`d_k`
.. math::
\mathbf{v_k}^T\mathbf{S_{XX}}\mathbf{v_k} = \frac{1}{N-1}d_k^2,
It turns out that if you have done the singular value decomposition then
you already have the Eigenvalue decomposition for
:math:`\mathbf{X}^T\mathbf{X}`. Where - The eigenvectors of
:math:`\mathbf{S_{XX}}` are equivalent to the right singular vectors,
:math:`\mathbf{V}`, of :math:`\mathbf{X}`. - The eigenvalues,
:math:`\lambda_k`, of :math:`\mathbf{S_{XX}}`, i.e. the variances of the
components, are equal to :math:`\frac{1}{N-1}` times the squared
singular values, :math:`d_k`.
Moreover computing PCA with SVD do not require to form the matrix
:math:`\mathbf{X}^T\mathbf{X}`, so computing the SVD is now the standard
way to calculate a principal components analysis from a data matrix,
unless only a handful of components are required.
PCA outputs
^^^^^^^^^^^
The SVD or the eigendecomposition of the data covariance matrix provides
three main quantities:
1. **Principal component directions** or **loadings** are the
**eigenvectors** of :math:`\mathbf{X}^T\mathbf{X}`. The
:math:`\mathbf{V}_{K \times P}` or the **right-singular vectors** of
an SVD of :math:`\mathbf{X}` are called principal component
directions of :math:`\mathbf{X}`. They are generally computed using
the SVD of :math:`\mathbf{X}`.
2. **Principal components** is the :math:`{N\times K}` matrix
:math:`\mathbf{C}` which is obtained by projecting :math:`\mathbf{X}`
onto the principal components directions, i.e.
.. math::
\mathbf{C}_{N\times K} = \mathbf{X}_{N \times P} \mathbf{V}_{P \times K}.
Since :math:`\mathbf{X} = \mathbf{UDV}^T` and :math:`\mathbf{V}` is
orthogonal (:math:`\mathbf{V}^T \mathbf{V} = \mathbf{I}`):
.. raw:: latex
\begin{align}
\mathbf{C}_{N\times K} &= \mathbf{UDV}^T_{N \times P} \mathbf{V}_{P \times K}\\
\mathbf{C}_{N\times K} &= \mathbf{UD}^T_{N \times K} \mathbf{I}_{K \times K}\\
\mathbf{C}_{N\times K} &= \mathbf{UD}^T_{N \times K}\\
\end{align}
Thus :math:`\mathbf{c}_j = \mathbf{X}\mathbf{v}_j = \mathbf{u}_j d_j`,
for :math:`j=1, \ldots K`. Hence :math:`\mathbf{u}_j` is simply the
projection of the row vectors of :math:`\mathbf{X}`, i.e., the input
predictor vectors, on the direction :math:`\mathbf{v}_j`, scaled by
:math:`d_j`.
.. math::
\mathbf{c}_1=
\begin{bmatrix}
x_{1,1}v_{1,1}+ \ldots +x_{1,P}v_{1,P}\\
x_{2,1}v_{1,1}+ \ldots +x_{2,P}v_{1,P}\\
\vdots\\
x_{N,1}v_{1,1}+ \ldots +x_{N,P}v_{1,P}
\end{bmatrix}
3. The **variance** of each component is given by the eigen values
:math:`\lambda_k, k=1, \dots K`. It can be obtained from the singular
values:
.. raw:: latex
\begin{align}
var(\mathbf{c}_k) =& \frac{1}{N-1}(\mathbf{X} \mathbf{v}_k)^2\\
=& \frac{1}{N-1}(\mathbf{u}_k d_k)^2\\
=& \frac{1}{N-1}d_k^2
\end{align}
Determining the number of PCs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We must choose :math:`K^* \in [1, \ldots, K]`, the number of required
components. This can be done by calculating the explained variance ratio
of the :math:`K^*` first components and by choosing :math:`K^*` such
that the **cumulative explained variance** ratio is greater than some
given threshold (e.g., :math:`\approx 90\%`). This is expressed as
.. math::
\mathrm{cumulative~explained~variance}(\mathbf{c}_k) = \frac{\sum_j^{K^*} var(\mathbf{c}_k)}{\sum_j^K var(\mathbf{c}_k)}.
Interpretation and visualization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**PCs**
Plot the samples projeted on first the principal components as e.g. PC1
against PC2.
**PC directions**
Exploring the loadings associated with a component provides the
contribution of each original variable in the component.
Remark: The loadings (PC directions) are the coefficients of multiple
regression of PC on original variables:
.. raw:: latex
\begin{align}
\mathbf{c} & = \mathbf{X} \mathbf{v}\\
\mathbf{X}^T \mathbf{c} & = \mathbf{X}^T \mathbf{X} \mathbf{v}\\
(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{c} & = \mathbf{v}
\end{align}
Another way to evaluate the contribution of the original variables in
each PC can be obtained by computing the correlation between the PCs and
the original variables, i.e. columns of :math:`\mathbf{X}`, denoted
:math:`\mathbf{x}_j`, for :math:`j=1, \ldots, P`. For the :math:`k^{th}`
PC, compute and plot the correlations with all original variables
.. math::
cor(\mathbf{c}_k, \mathbf{x}_j), j=1 \ldots K, j=1 \ldots K.
These quantities are sometimes called the *correlation loadings*.
.. code:: ipython3
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
np.random.seed(42)
# dataset
n_samples = 100
experience = np.random.normal(size=n_samples)
salary = 1500 + experience + np.random.normal(size=n_samples, scale=.5)
X = np.column_stack([experience, salary])
# PCA with scikit-learn
pca = PCA(n_components=2)
pca.fit(X)
print(pca.explained_variance_ratio_)
PC = pca.transform(X)
plt.subplot(121)
plt.scatter(X[:, 0], X[:, 1])
plt.xlabel("x1"); plt.ylabel("x2")
plt.subplot(122)
plt.scatter(PC[:, 0], PC[:, 1])
plt.xlabel("PC1 (var=%.2f)" % pca.explained_variance_ratio_[0])
plt.ylabel("PC2 (var=%.2f)" % pca.explained_variance_ratio_[1])
plt.axis('equal')
plt.tight_layout()
.. parsed-literal::
[0.93646607 0.06353393]
.. image:: decomposition_files/decomposition_4_1.png
.. code:: ipython3
from time import time
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import offsetbox
from sklearn import (manifold, datasets, decomposition, ensemble,
discriminant_analysis, random_projection, neighbors)
print(__doc__)
digits = datasets.load_digits(n_class=6)
X = digits.data
y = digits.target
n_samples, n_features = X.shape
n_neighbors = 30
.. parsed-literal::
Automatically created module for IPython interactive environment
Eigen faces
-----------
Sources: `Scikit learn Faces
decompositions `__
Load data
.. code:: ipython3
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_olivetti_faces
from sklearn import decomposition
n_row, n_col = 2, 3
n_components = n_row * n_col
image_shape = (64, 64)
faces, _ = fetch_olivetti_faces(return_X_y=True, shuffle=True,
random_state=1)
n_samples, n_features = faces.shape
# Utils function
def plot_gallery(title, images, n_col=n_col, n_row=n_row, cmap=plt.cm.gray):
plt.figure(figsize=(2. * n_col, 2.26 * n_row))
plt.suptitle(title, size=16)
for i, comp in enumerate(images):
plt.subplot(n_row, n_col, i + 1)
vmax = max(comp.max(), -comp.min())
plt.imshow(comp.reshape(image_shape), cmap=cmap,
interpolation='nearest',
vmin=-vmax, vmax=vmax)
plt.xticks(())
plt.yticks(())
plt.subplots_adjust(0.01, 0.05, 0.99, 0.93, 0.04, 0.)
Preprocessing
.. code:: ipython3
# global centering
faces_centered = faces - faces.mean(axis=0)
# local centering
faces_centered -= faces_centered.mean(axis=1).reshape(n_samples, -1)
First centered Olivetti faces
.. code:: ipython3
plot_gallery("First centered Olivetti faces", faces_centered[:n_components])
.. image:: decomposition_files/decomposition_11_0.png
.. code:: ipython3
pca = decomposition.PCA(n_components=n_components)
pca.fit(faces_centered)
plot_gallery("PCA first %i loadings" % n_components, pca.components_[:n_components])
.. image:: decomposition_files/decomposition_12_0.png
Exercises
---------
Write a basic PCA class
~~~~~~~~~~~~~~~~~~~~~~~
Write a class ``BasicPCA`` with two methods:
- ``fit(X)`` that estimates the data mean, principal components
directions :math:`\textbf{V}` and the explained variance of each
component.
- ``transform(X)`` that projects the data onto the principal
components.
Check that your ``BasicPCA`` gave similar results, compared to the
results from ``sklearn``.
Apply your Basic PCA on the ``iris`` dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The data set is available at:
https://github.com/duchesnay/pystatsml/raw/master/datasets/iris.csv
- Describe the data set. Should the dataset been standardized?
- Describe the structure of correlations among variables.
- Compute a PCA with the maximum number of components.
- Compute the cumulative explained variance ratio. Determine the number
of components :math:`K` by your computed values.
- Print the :math:`K` principal components directions and correlations
of the :math:`K` principal components with the original variables.
Interpret the contribution of the original variables into the PC.
- Plot the samples projected into the :math:`K` first PCs.
- Color samples by their species.
Run scikit-learn examples
~~~~~~~~~~~~~~~~~~~~~~~~~
Load the notebook or python file at the end of each examples
- `Faces dataset
decompositions `__
- `Faces recognition example using eigenfaces and
SVMs `__