Decoding Dimensionality Reduction, PCA and SVD

Machine Learning   |   
Published May 20, 2015   |   

Every day IBM creates 2.5 quintillion bytes of data and most of the  data generated are high dimensional. So it is necessary to reduce the dimensions of the data to work efficiently.
One of the most common dimensionality reduction technique is filtering, in which you leave most of the dimensions and concentrate only on certain dimensions. But that doesn’t always work, when you are dealing with image data, the number of pixels represents the number of dimensions in the image. Now you have lot of dimensions and you don’t want to throw out dimensions inorder to make sense of your overall data set.
As the dimensionality of your data increases, the volume of the space increases, in a sense the data you have becomes more and more sparse(scattered). One way to think about it is a very high data set might live in some kind of high dimensional manifold and as you are increasing the number of dimensions, that manifold becomes bigger and bigger.
To do any statistical modelling, you have to increase the number of data points and samples you have, unfortunately that grows exponentially with the number of dimensions you have. The higher dimension you have, the more data you need to perform statistical inference. The basic idea is n (the number of data items) should be more than the number of dimensions.
The image(shown in fig below) has 64×64 pixels(4096 dimensions). To reduce the dimensions, you want to Project the high-dimensional data onto a lower dimensional subspace using linear or non-linear transformations (or projections).
dimensionality reduction in machine learning
The most widely used methods are linear projections and the main linear projection method is principal component analysis(PCA).
Principal component analysis (PCA)

You have data points that live in a 2D plane(x1 and x2)  and you want to approximate them  with a lower dimensional embedding. Here it is obvious that the vector V(as shown in the image) is a pretty good approximation of your data. Instead of storing 2 coordinates for each data points, you will store one scalar value plus the vector V, which is common across all of the data points, so you have to store it only once. So for each data point you have to store only this scalar value s, which gives the distance along this vector V.
image 2
In the figure below, you are projecting all of your data points on to V. What you are trying to do is to minimize in least square sense(the difference between your original data and your projections). You should  choose V in a way that you have to minimize the residual variance.
*residuals are the projections of these data points on to the vector V
min x1 and x2 vectors
In this case the projection is orthogonal to vector V.  You  have to minimize the  overall sum of all of the residuals of your data points, by choosing the vector in such a way that the overall sum is minimized. It turns out that it is also the vector that closely allows you to reconstruct the original data using the least squared errors.  In this case it makes intuitive sense, you are picking V in the direction of biggest spread of your data. You can extend this to multiple components. Here is the main component which we call the principal component, that’s the vector V that you are using to project the data on. And then you can repeat the process and then find the second component that has second biggest variance of the data, in this case is the direction of principal comp2 (see the image below).
pc vectors orthogonal
It is very important to understand the difference between Principal Component Analysis (PCA) and Ordinary Least Squares (OLS). I encourage you to go through this discussion on stack exchange.
To summarize, you are projecting your data on to these sub spaces(on to the principal components) to maximize the variance of the projected data, thats the basic idea.
Understanding Singular Value decomposition(SVD):
The PCA algorithm is implemented as follows:
1) Subtract the mean from the data.
2)  Scale each dimensions by its variance.
3) Compute the covariance matrix S. Here X is a data matrix.
x transpose
4) Compute K largest eigen vectors of S.  These eigen vectors are the principal components of the data set.
Note: Eigenvectors and values exist in pairs: every eigen vector has a corresponding eigenvalue.  Eigen vector is the direction of the line (vertical, horizontal, 45 degrees etc.).  An eigenvalue is a number, telling you how much variance there is in the data in that direction,  eigenvalue is a number telling you how spread out the data is on the line.
But the operation(steps to implement PCA) is expensive when X is very large or when X is very small. So the best way to compute principal components is by using SVD. Singular value decomposition (SVD) is one of the best linear transformation methods  out there.
You can take any matrix X, it doesn’t matter if it is square, singular or diagonal, you can decompose it into a product of three matrices(as shown in the figure below); two orthogonal matrices U and V and diagonal matrix D. The orthogonal matrix has same dimensions as your data matrix and then your diagonal matrix is square and it has dimensions kxk (k is the number of variables you have), V is again a square matrix.
square matrix
I am going compute SVD on S(co-variance matrix) to obtain it’s eigen vectors.
x transpose
vector transpose
Some interesting notes on the result of factorization:
– The columns of matrix U form the eigen vectors of S.
– The D matrix is a diagonal matrix. The values of the diagonal are called eigen values and are in descending order.
This is much more equivalent to calculate principal components analysis, but in a  much more robust way. You just need to take your matrix and feed it into svd.
What Does Svd Has To Do With Dimensionality Reduction ?
The image below shows how I reduce the number of dimensions from k to q(k<q).  If you reduce the number of column vectors to q , then you have obtained the q-dimensional hyper-plane in this example. The values of D gives you the amount of variance retained by this reduction.
svd and dimensionality reduction
The best way to figure out how much variance does your dimensions capture is to plot a scree plot.
variance explained
It turns out that the variance of each principal component is related to d square(diagonal elements of d matrix) . In the scree plot, the first principal component explains 37% of the variance and then it quickly drops. It is better to figure out principal components that explain 80-90% variance of your data set.
Image recognition example
lets say your task is to recognize faces of people, You take photo graph of people and you make small pictures and then you center all the faces such that they are roughly aligned.
image recognition example svd
lets say the image is 64 x64 pixels. So you have 4096 dimensions, you are rolling out each of the images into a vector, and then stack them up into your data matrix. Each pixel is dimensional and each row is a different person.
high dimensional data
So you have this data matrix(as shown in the above image) and you apply Pca on that one. Here is what you get (as shown in the image below), these are called eigen faces. The image on the left(this is the mean sort of average face). The first image on the right hand side sort of explains the variance in left right dimensions, the 2nd image on seems to explain the variance in front to back.
principal components
In the image below, you can actually look at the variance of the components, it turns out that you need 50 eigen vectors to explain 90 % of the variance of a image.
eigen values for faces
As you can see after 50 eigen vectors this is a pretty good reconstruction. So you went from 4096 dimensions to 50 that’s a nice reduction in dimensions without too much reduction in quality. I hope I have given a broad idea of  what is dimensionality reduction, Pca and Svd without getting into too much mathematical details.