Git Product home page Git Product logo

guide-pca's Introduction

Pricipal Component Analysis (PCA)

Principal Component Analysis belongs to a class of linear transforms bases on statistical techniques. This method provides a powerful toll for data analysis and pattern recognition, which is often preferred in signal an imagage processing as a technique of data compression, data dimension reduction, or decorrelation. PCA is an unsipervides learning method similar to clustering. It finds patterns without prior knowledge about whether the samples come from different treatment groups or essential differences. The objective is pursued by analysing principal components where we can perceive relationships that would otherwise remain hidden in higher dimensions. The representation processed must be such that the loss of information must be minimal after discarding the higher dimensions. The goal of the methods is to reorient the data so that a multitude of original variables can be summarized with relatively few "factors" or "components" that capture the maximum possible information from the original variables.

The output of PCA is principal components, which are less than or equal to the number of original variables. Less, in a case when we wish to discard or reduce the dimensions in our dataset.

The image transformation technique from colour to the gray level, i.e. the intensity of the image, can be done using most of the common algorithms. According to relation, the implementations is usually based on the weighted sum of three core colour components Red, Green, and Blue. The R, G and B matrices contain image colour components, and the ewights are determined regarding the posibilities of human preception. The PCA method provides an alternative way to this method, where the matrix A is replaced by matrix Al where only l largest (instead of n) eigenvalues are used for its formation. A vector of reconstructed variables is then given by relation. A selected real picture P and Its three reconstructed components are obtained accordingly for each eigenvalue and presented. The comparison of the intensity of images obtained from the original image as the weighted colour sum is evaluated as the first principal component. The variance figures for each principal component are present in the eigenvalue list. These indicate the amount of variation accounted for by each component within the feature space.

Getting Started with the Code

Importing labraries

import numpy as np
from matplotlib.image import imread
import matplotlib.pyplot as plt

Here we sill using imread from matplotlib to import the image as a matrix

Setting image path

my_image=imread("/path/.png")
print(my_image.shape)

Displaying the image

plt.figure(figsize=[12,8])
plt.imshow(my_image)

The image being processed is a coloured image and hance has data in 3 channels-Red, Green, Blue. Therefore the shape of the data (shape of data)

Processing the image

Let us now start with our image processing. Here first, we will be grayscaling our image, and then we'll perform PCA on the matrix with all the components. We will also create and look at the scree plot (In multivariate statistics, a scree plot is a line plot of the eigenvalues of factors or principal components in an analysis) to assess how many components we could retain and how much cumulative variance they capture.

Greyscaling the image

image_sum = my_image.sum(axis=2)
print(image_sum.shape)

new_image = image_sum/image_sum.max()
print(new_image.max())

plt.figure(figsize=[12,8])
plt.imshow(new_image, cmap=plt.cm.gray)

Creating scree plot

from sklearn.decomposition import PCA, IncrementalPCA
pca = PCA()
pca.fit(new_image)

Getting the cumulative variance

var_cumu = np.cumsum(pca.explained_variance_ratio_)*100

How many PCs explain 95% of the variance?

k = np.argmax(var_cumu>95)
print("Number of components explaining 95% variance: "+str(k))
print("\n")

plt.figure(figsize=[10,5])
plt.title('Cumulative Explained Variance explained by the components')
plt.xlabel('Principal components')
plt.ylabel('Cumulative Explained variance')
plt.axvline(x=k, color="k", linestyle="--")
plt.axhline(y=95, color="r", linestyle="--")
ax = plt.plot(var_cumu)

Now let's reconstruct the image using only (#) components and see if our reconstructed image comes out to be visually different from the original image

Reconstructing using Inverse Transform

ipca = IncrementalPCA(n_components=k)
image_recon = ipca.inverse_transform(ipca.fit_transform(new_image))

Plotting the reconstructed image

plt.figure(figsize=[12,8])
plt.imshow(image_recon, cmap=plt.cm.gray)

As we can observe, there is a relative difference now. We shall try with a different value of components to check if that maked a difference in the missin clariry and help capture finer details in the visuals.

Function to reconstruct and plot image for a given number of components

def plot_at_k(k):
	ipca = IncrementalPCA(n_components=k)
	image_recon = ipca.inverse_transform(ipca.fit_transform(new_image))
	plt.imshow(image_recon, cmap=plt.cm.gray)

k = 150
plt.figure(figsize=[12,8])
plot_at_k(100)

We can observe that, yes, the number of principal components do make a difference!

Plotting the same for different numbers of components to compare the exact relative difference,

Setting different amount of K

ks = [10, 25, 50, 100, 150, 250]

plt.figure(figsize=[15,9])

for i in range(6):
	plt.subplot(2,3,i+1)
	plot_at_k(ks[i])
	plt.title("Components: "+str(ks[i]))

plt.subplots_adjust(wspace=0.2, hspace=0.0)
plt.show()

Using PCA for Image Reconstruction, we can also segregate between the amounts of RGB present in an image,

import cv2
img = cv2.cvtColor(cv2.imread('/path/.image'), cv2.COLOR_BGR2RGB)
plt.imshow(img)
plt.show()

Splittin into channels

blue, green, red = cv2.split(img)

Plotting the images

fig = plt.figure(figsize=(15,7.2))
fig.add_subplot(131)
plt.title("Blue Presence")
plt.imshow(blue)
fig.add_subplot(132)
plt.title("Green Presence")
plt.imshow(green)
fig.add_subplot(133)
plt.title("Red Presence")
plt.imshow(red)
plt.show()

A particular image channel can also be converted into a data frame for further processing,

import numpy as np
import pandas as pd

Creating dataframe from blue presence in image

blue_chnl_df = pd.DataFrame(data=blue)
blue_chnl_df

The data fot each color presence can also be fit and transformed to a particular number of components for checking the variance of each color presence,

Scaling data between 0 to 1

df_blue = blue/255
df_green = green/255
df_red = red/255

Setting a reduced number of components

pca_b = PCA(n_components=50)
pca_b.fit(df_blue)
trans_pca_b = pca_b.transform(df_blue)
pca_g = PCA(n_components=50)
pca_g.fit(df_green)
trans_pca_g = pca_g.transform(df_green)
pca_r = PCA(n_components=50)
pca_r.fit(df_red)
trans_pca_r = pca_r.transform(df_red)

Transforming shape

print(trans_pca_b.shape)
print(trans_pca_r.shape)
print(trans_pca_g.shape)

Checking variance after reduced components

print(f"Blue Channel : {sum(pca_b.explained_variance_ratio_)}")
print(f"Green Channel: {sum(pca_g.explained_variance_ratio_)}")
print(f"Red Channel  : {sum(pca_r.explained_variance_ratio_)}")

Output(example):

Blue Channel : 0.9835704508744926
Green Channel: 0.9794100254497594
Red Channel  : 0.9763416610407115

We can observe that by using 50 components we can keep around 98% of the variance in the data!


PCA, the instance of the eigen-analysis

PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

PCA objetive is to rotate rigidly the coordinate axes of the $p$-dimensional linear space to new "natural" positions (pricipal axes) such that:

  • Coordinate axes are ordered sush that principal axis 1 corresponds to the highest variance in data, axis 2 has the next highest variance,..., and axis $p$ has the lowest variance.
  • The covariance among each pair of principal exes is zero, i.e, they are uncorralated.

Geometric motivation, principal components

  • Two- dimensional vector space of observations, $(x_1, x_2)$
  • Each observation corresponds to a single point in the vector space.
  • The goal: Find another basis of the vector space, which treats variations of fata better.
  • We sill see later: Data points (observations) are represented in a rotated orthgonal cooridante system. The origin is the mean of the data points and the axes are provied by the eigenvectors.
  • Assume a single straight line approximating best the observation in the (total) least-square sense, i.e. by minimizing the sum of perpendicular distances between data points and the line.
  • The fistr pricipal direction (component) is the direction of this line. Let it be a new basis vector $z_1$.
  • The second principal direction (component, basis vector) $z_2$ is a direction perpendicular to $z_1$ and minimizing the distances to data points to a correspinding straight line.
  • For higher dimensional observation spces, this construction is repeated.

##Principal component analysis, introduction

  • PCA is a powerful and widely used linear technique in statics, signal processing, image processing, and elsewhere.
  • In statistics, PCA is a method for simplifyng a multidimensional dataset to lower dimesions for analysis, visualization or data compression.
  • PCA represents the data in a new coordiante system in which basis vectors follow modes of greatest variance in the data.
  • Thus, new basis ectros are calculated for the particular data set.
  • The price to be pains for PCA's flecibility is in higher computational requirements as comapered to, e.g. ,the fats Fourier transform.

Derivation, $M$-dimensional case

  • Suppose a data set comprising $N$ observations, each of $M$ variables (dimensions). Usually $N\gg M$.
  • The aim: to reduce the dimensionality of the data so that each observation can be usefully represented with only $L$ variables, $1\leq L\leq M$.
  • Data are arranged as a set of $N$ column data vectors, each representing a single observation of $M$ variables: the $n$-th observations is a column vector $\vb{x}$

guide-pca's People

Contributors

10blackhole avatar

Stargazers

Roman avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.