Principal Component Analysis, also referred to as PCA is another very common unsupervised learning technique. PCA is a dimensionality reduction technique which is often used in practice for visualization, feature extraction, noise filtering, etc.
Generally, PCA would be applied on data sets with many variables. PCA creates new variables that are linear combinations of the original variables. The idea is to reduce the dimension of the data considerably while maintaining as much information as possible. While the purpose is to significantly reduce the dimensionality, the maximum amount of new variables that can possibly be created is equal to the number of original variables. A nice feature of PCA is that the newly created variables are uncorrelated.
A simple example : Imagine that a data set consists of the height and weight of a group of people. One could imagine that these 2 metrics are heavily correlated, so we could basically summarize these 2 metrics in one variable, a linear combination of the two. This one variable will contain most of the information from 2 variables in one. It is important to note that the effectiveness of PCA strongly depends on the structure of the correlation matrix of the existing variables!
In order to understand how PCA actually works, we first need to be comfortable with Eigenvectors and Eigenvalues.
An eigenvector is a vector that after transformation hasn't changed, except by a scalar value known as the eigenvalue.
If there exists a square matrix
This vector
Eigenvalues and eigenvectors are very useful and have tons of applications!
Imagine you have a matrix
\begin{equation} A = \begin{bmatrix} 3 & 2 \ 3 & -2 \end{bmatrix} \end{equation}
We have an eigenvector \begin{equation} v = \begin{bmatrix} 2 \ 1 \end{bmatrix} \end{equation}
Let's perform the multiplication
\begin{equation} Av = \begin{bmatrix} 3 & 2 \ 3 & -2 \end{bmatrix} \begin{bmatrix} 2 \ 1 \end{bmatrix} = \begin{bmatrix} 32+21 \ 32+(-21) \end{bmatrix} = \begin{bmatrix} 8 \ 4 \end{bmatrix} \end{equation}
Now we want to see if we can find a
\begin{equation} Av = \begin{bmatrix} 8 \ 4 \end{bmatrix}= \lambda \begin{bmatrix} 2 \ 1 \end{bmatrix} \end{equation}
Turns out
An
$ det(A- \lambda I)= 0$
\begin{equation} det(A- \lambda I) = det\begin{bmatrix} 3-\lambda & 2 \ 3 & -2-\lambda \end{bmatrix} \end{equation}
This way we indeed find that 4 is an eigenvalue, and so is -3! You'll learn about the connection between eigenvalues and eigenmatrices in a second.
https://www.youtube.com/watch?v=ue3yoeZvt8E
Let's say we have P variables
\begin{bmatrix} X_{11} & X_{12} & X_{13} & \dots & X_{1p} \ X_{21} & X_{22} & X_{23} & \dots & X_{2p} \ \vdots & \vdots & \vdots & \ddots & \vdots \ X_{n1} & X_{n2} & X_{n3} & \dots & X_{np} \end{bmatrix}
For 2 variables, this is what our data could look like:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
np.random.seed(123)
X = np.random.normal(2, 1.5, 50)
Y = np.random.normal(3, 0.6, 50)
fig, ax = plt.subplots()
ax.axhline(y=0, color='k')
ax.axvline(x=0, color='k')
plt.ylim(-5,5)
plt.xlim(-7,7)
plt.scatter(X, Y, s = 6)
<matplotlib.collections.PathCollection at 0x1169dce48>
3.2 The mean & mean-corrected data
The mean of the
To get to the mean-corrected data: substract the mean from each
Going back to our two variables example, this is how the data would be shifted:
fig, ax = plt.subplots()
ax.axhline(y=0, color='k')
ax.axvline(x=0, color='k')
X_mean = X- np.mean(X)
Y_mean = Y- np.mean(Y)
plt.ylim(-5,5)
plt.xlim(-7,7)
plt.scatter(X_mean, Y_mean, s = 6)
<matplotlib.collections.PathCollection at 0x116af3940>
3.3 The variance & standardized data
To get to the standardized data: divide the mean-corrected data by the standard deviation
Going back to the example with 2 variables, this is what standardized data would look like:
fig, ax = plt.subplots()
ax.axhline(y=0, color='k')
ax.axvline(x=0, color='k')
X_mean = X- np.mean(X)
Y_mean = Y- np.mean(Y)
X_std= np.std(X)
Y_std = np.std(Y)
X_stdized = X_mean / X_std
Y_stdized = Y_mean / Y_std
plt.ylim(-5,5)
plt.xlim(-7,7)
plt.scatter(X_stdized, Y_stdized, s=6)
<matplotlib.collections.PathCollection at 0x116bbe6a0>
3.4 The covariance
The covariance for two variables
Denote
\begin{equation} \mathbf{S} = \begin{bmatrix} s^2_{1} & s_{12} & \dots & s_{1p} \ s_{21} & s^2_{2} & \dots & s_{2p} \ \vdots & \vdots & \ddots & \vdots \ s_{p1} & s_{p2} & \dots & s^2_{p} \end{bmatrix} \end{equation}
When you do the same computation with standardized variables, you get the correlation. Remember that the correlation
Then,
\begin{equation} \mathbf{R} = \begin{bmatrix} 1 & r_{12} & \dots & r_{1p} \ r_{21} & 1 & \dots & r_{2p} \ \vdots & \vdots & \ddots & \vdots \ r_{p1} & r_{p2} & \dots & 1 \end{bmatrix} \end{equation}
- How does PCA work? Matrices and eigendecomposition
4.1 Finding principal components
$ \mathbf{X}= (X_1, X_2, \ldots, X_p)$ is a random variable. Then the principal components of
$\mathbf{X}$ , denoted by$PC_1, \ldots, PC_p$ satisfy these 3 conditions:
The variance of
In words, this means that variances can easily be computed using the coefficients used while making the linear combinations.
We can prove that
Sources http://www.bbk.ac.uk/ems/faculty/brooms/teaching/SMDA/SMDA-02.pdf
https://stackoverflow.com/questions/13224362/principal-component-analysis-pca-in-python