I was revising my understanding of Principal Components Analysis and I thought it might be a good idea to articulate what I understood in the form of this blog post. I intend this to serve as quick a tutorial for any future reader who has some idea about PCA and wants to brush up the concepts.
First off, let me start with pointing out an excellent resource to understand almost everything you would want to about PCA. Check this PCA tutorial out: https://arxiv.org/pdf/1404.1100.pdf. It covers both the math and the intuition behind PCA (I think I will write only about the inuition).
What’s the goal with PCA?
The idea with PCA is that given a system defined by some features, we want to find an alternate set of features such that the reconstruction error when we move from one set of features to another, is minimum. Another way to look at this is that we are given a high dimensional dataset and we want to bring this down to a lower dimension, such that the we still represent most of the information. We are in some sense trying to separate the signal (relevant features) from the noise. A general assumption is that the new set of features we find is a linear combination of the old set of features.
How is PCA calculated afterall?
Just follow these steps:
- Represent your dataset as an mxn matrix, where m represents the number of data points and n represents the number of columns
- Subtract the mean of your data-set from all poins, such that the mean of the overall data is zero. Let’s call this new dataset X
- Find the Covariance matrix of this mean subtracted matrix as XTX
- Calculate the SVD of the covariance matrix. The columns of the V matrix are the principal components. (They are also the eigen vectors)
The above steps seem easy. Why do they work?
As we did earlier, imagine we are given a matrix X where each row corresponds to a datapoint and each column corresponds to one of the features. We want to find a new feature space where the correlation between individual features is zero and also there is an ordering in the features such that the first few features can reconstruct the entire data very accurately, that is to say, the variance within each feature should be high. This entire concept is neatly packed into a covariance matrix. We want to transform X into Y ( by a linear transformation ) such that, the covariance matrix of Y is a diagonal matrix. As the off-diagonal elements in a covariance matrix represent the correlation between pairs of features and the diagonal elements represents the variance within each feature (or the correlation between a feature with itself), a diagonal matrix gives us the solution we are looking for. Or more importantly, whatever it took to create this diagonal matrix is the solution we are looking for. (Refer to section V and VI in the linked paper).
We know that we can use an SVD decomposition to diagonalise a matrix. So, given the covariance matrix, we write it in the form of USV, where S is the diagonalised matrix and V (or UT) can give us the principal components. (In the interest of time, I was quick with a few parts here. Please refer to the paper!)