Git Product home page Git Product logo

dsc-enterprise-deloitte-dl-performing-pca's Introduction

Performing Principal Component Analysis (PCA)

Introduction

In this lesson, you'll code PCA from the ground up using NumPy. This should provide you with a deeper understanding of the algorithm and continue to practice your linear algebra skills.

Objectives

You will be able to:

  • Understand the steps required to perform PCA on a given dataset
  • Understand and explain the role of Eigendecomposition in PCA

Step 1: Get some data

To start, generate some data for PCA!

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x1 = np.random.uniform(low=0, high=10, size=100)
x2 = [(xi*3)+np.random.normal(scale=2) for xi in x1]
plt.scatter(x1,x2);

png

Step 2: Subtract the mean

Next, you have to subtract the mean from each dimension of the data. So, all the $x$ values have $\bar{x}$ (the mean of the $x$ values of all the data points) subtracted, and all the $y$ values have $\bar{y}$ subtracted from them.

import pandas as pd

data = pd.DataFrame([x1,x2]).transpose()
data.columns = ['x1', 'x2']
data.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
x1 x2
0 1.032030 -0.492450
1 3.979683 9.763812
2 8.883582 28.979365
3 9.893061 28.469174
4 8.145913 25.109462
data.mean()
x1     4.535001
x2    13.566201
dtype: float64
mean_centered = data - data.mean()
mean_centered.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
x1 x2
0 -3.502971 -14.058652
1 -0.555318 -3.802390
2 4.348581 15.413164
3 5.358060 14.902973
4 3.610912 11.543261

Step 3: Calculate the covariance matrix

Now that you have normalized your data, you must now calculate the covariance matrix.

cov = np.cov([mean_centered.x1, mean_centered.x2])
cov
array([[ 7.15958302, 21.47416477],
       [21.47416477, 67.92443943]])

Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix

Now that you've calculated the covariance matrix, its time to compute the associated eigenvectors. These will form the new axes when its time to reproject the dataset on the new basis.

eigen_value, eigen_vector = np.linalg.eig(cov)
eigen_vector
array([[-0.95305204, -0.30280656],
       [ 0.30280656, -0.95305204]])
eigen_value
array([ 0.33674687, 74.74727559])

Step 5: Choosing components and forming a feature vector

If you look at the eigenvectors and eigenvalues above, you can see that the eigenvalues have very different values. In fact, it turns out that the eigenvector with the highest eigenvalue is the principal component of the data set.

In general, once eigenvectors are found from the covariance matrix, the next step is to order them by eigenvalue, highest to lowest. This gives us the components in order of significance. Typically, PCA will be used to reduce the dimensionality of the dataset and, as such, some of these eigenvectors will be subsequently discarded. In general, the smaller the eigenvalue relative to others, the less information encoded within said feature.

Finally, you need to form a feature vector, which is just a fancy name for a matrix of vectors. This is constructed by taking the eigenvectors that you want to keep from the list of eigenvectors, and forming a matrix with these eigenvectors in the columns as shown below:

e_indices = np.argsort(eigen_value)[::-1] #Get the index values of the sorted eigenvalues
eigenvectors_sorted = eigen_vector[:,e_indices]
eigenvectors_sorted
array([[-0.30280656, -0.95305204],
       [-0.95305204,  0.30280656]])

Step 5: Deriving the new data set

This the final step in PCA, and is also the easiest. Once you have chosen the components (eigenvectors) that you wish to keep in our data and formed a feature vector, you simply take the transpose of the vector and multiply it on the left of the original data set, transposed.

transformed = eigenvectors_sorted.dot(mean_centered.T).T
transformed[:5]
array([[ 14.4593493 ,  -0.91853819],
       [  3.79202935,  -0.62214167],
       [-16.00632588,   0.52278351],
       [-15.82576429,  -0.59379253],
       [-12.0947357 ,   0.05398833]])

Summary

That's it! You just coded PCA on your own using NumPy! In the next lab, you'll continue to practice this on your own!

dsc-enterprise-deloitte-dl-performing-pca's People

Contributors

loredirick avatar mathymitchell avatar shakeelraja avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.