Monday, December 9, 2019

Tutorial of Principal Component Analysis | Derivation | Python Implementation

Advancement of technologies like Machine Learning and Artificial Intelligence, It has become crucial to understand the fundamentals behind such technologies. With the help of this blog, I will help you understand the concept behind dimensionality reduction, derivation of PCA, and python implementation.

Principal Component Analysis

Principal Component Analysis (PCA) is one of the simple machine learning algorithms which can be derived using only knowledge of basic linear algebra.

What is Principal Component Analysis?

Principal Component Analysis (PCA) is a dimensionality reduction technique that enables you to identify correlation and patterns in a data set so that it can be transformed into a data set of significantly lower dimensions without loss of any important information.

Sometimes Data can be so complicated, it may be challenging to understand what it all means and which part are actually important. In that case, PCA figures out patterns and correlations among various features in the data set. On finding a strong correlation between different variables or features, a final decision is made thereby reducing the dimensions of the data in such a way that the significant data is still retained.

From a high-level PCA has three main steps:

1. Standardization of data.
2. Computing the covariance matrix of the data.
3. Compute the eigenvalues and vectors of this covariance matrix.
4. By using eigenvalues and vectors to select only the most important feature vector and then transform data onto those vectors for reduced dimensionality.

Now Let's discuss each of the steps in detail:

Standardization of the data:

In data analysis and processing we often saw that missing out standardization will probably result in a biased outcome. Standardization is all about scaling data in such a way that all the variables and their values lie within a similar range.

Now understand this with an example, let's say we have 2 variable one is age which has values ranging between 10-50 and the other is salary which has values ranging between 10000-100000. In such a scenario, it is obvious that the output calculated by using these variables is going to be biased since the variable with a larger range will have a more impact on the outcome.
Standardization of data can be calculated by:

Computing the covariance matrix of data

Principal Component Analysis helps to identify the correlation and dependencies among the features in a dataset. A covariance matrix states the correlation between the different variables in the dataset. It is essential to identify the highly dependent variables because they contain biased and redundant information which reduces the overall performance of the model. 
The covariance matrix is just an array where each value specifies the covariance between two feature variable based on the x-y position in the matrix. The formula is:

where the x with the line on top is a vector of mean values for each feature of X. When we multiply a transposed matrix by the original one we end up multiplying each of the features for each data point together. The code for this is:

If the covariance value is negative, it denotes the respective variables are indirectly proportional to each other.
A positive covariance denotes that the respective variables are directly proportional to each other.

Computing Eigen Values and Vectors

Eigen Vectors and Eigen Values must be computed from the covariance matrix in order to determine the principal components of the data set.
If an eigenvector has a corresponding eigenvalue of high magnitude it means that our data has high variance along that vector in feature space. This vector holds a lot of information about our data since any movement along that vector causes large "Variance". 
On the other hand, vectors with small eigenvalues have low variance and our data does not vary greatly when moving along that vector. Since nothing changes when moving along that particular feature vector i.e changing the value of that feature vector does not greatly affect our data, then we can say that this feature is not very important and we can afford to remove it.
So, that's the whole structure of eigen values and vectors within Principal Component Analysis. In which find the vectors that are the most important in representing our data and discard the rest. Now, let's see the code computing the eigen vector and values of our covariance matrix is a one-liner in numpy. Next, we sort the eigen vector in descending order based on eigen values.

Project onto new vectors

Now, We have eigenvectors ordered in order of "importance" to our dataset based on their eigenvalues. We need to select the most important feature vectors that we want and discard the rest. 
Let's take an example, we have a dataset that basically has 10 feature vectors. After computing the covariance matrix, we discover that the eigenvalues are:
The total sum of an array is = 43.1359. But the first 6 values represent 42/43.1359 = 99.68% of the total, which means our first 5 eigenvectors effectively hold 99.68% of the variance or information about our dataset. Now, we can discard the last 4 feature vector as they only contain 0.32% of the information.
We can simply define a threshold upon which we can decide whether to keep or discard each feature vector. Here, in this code, we define the selected threshold of 97%.

The ultimate step is to project our data on the vector we chose to keep. We can do this by making a projection matrix: In projection matrix, we will multiply by to project our data onto the new vectors. To create we concatenate all the eigenvectors we decided to keep. then our final step is to simply take the dot product between our original data and our projection matrix.

Now Dimension Reduced!

For any doubt and issue put your comments in comment box.