Singular Value Decomposition

Data ScienceSVDMathematicsDatasets

Intution and logic behind SVD

Singular Value Decomposition (SVD)

Before talking about SVD, let us discuss the root cause why we wanted one like this.

In linear algebra, people have some high beauty standards. The love for square matrices. These were some demigods to them. They feel that they have so many good properties. But due to the extreme love towards these kind of matrices, the rectangular matrices were left behind.

Let us discuss one of the main thing that people wanted to deal even with rectangular matrices. The diagonalization of a matrix. This is done in square matrices and it is a decomposition of a square matrix into 3 different matrices

A=PDP1A = PDP^{-1}

This kind of decomposition helps people in a lot of ways and is considered good. One of the example is An=PDnP1A^{n} = PD^{n}P^{-1} and many other things which is not in the scope of this text.

So, thus came the idea of SVD, as the diagonalization is the thing only for square matrices. In rectangular matrices, we can't do this kind of decomposition. Hence we sought of a different technique.

Instead of decomposing the matrix AA we decompose the matrix ATAA^{T}A and find things out for this. Even finding a determinant for a rectangular matrix uses this technique and is called a pseudo inverse. So, let us deal with the new thing.

Before going into the SVD, I shall write some of the questions I have got while reading the text on SVD in the book Foundations of Data Science by Blum.

  1. If there is a matrix of d rows but the rank is k given k<dk < d. Then what is the best subspace VV for which the matrix can be shrinked to?

To know the answer, we shall first go through the situations where this comes helpful in real life. Let us taake the example of Machine learning. In this field, we typically deal with huge datasets which are often of low rank than the full rank of the matrix. In these situations, we prefer to fit the matrix into a subspace of lower dimensions which would eventually help us in doing the math over it. Great, it seems well enough. Then how?

The best fit subspace

Given that we have a matrix AA, and we want to find the best fit subspace, we denote it with the following (calm down, I shall dissect each of them).

A=UDV1A = UDV^{-1}

let us say that the rows in UU are uiu_i and the rows of VV are viv_i. The entries in the matrix DD are on the principal diagonal and is a diagonal matrix. We define each of the rows in each matrix soon.

Let us say we have got some "kind of" eigen values and eigen vectors for the matrix ATAA^{T}A. So, we follow the following procedure.

ATAv=σi2vσi=λi(ATA)=λi2  (Since it is a kind of square of the matrix, we will be getting the eigen value squared)=λi\begin{align} A^{T}A v &= \sigma_{i}^2 v \\ \sigma_{i} &= \sqrt{\lambda_{i}(A^{T}A)} \\ &= \sqrt{|\lambda_{i}|^2} \; \text{(Since it is a kind of square of the matrix, we will be getting the eigen value squared)}\\ &= |\lambda_{i}| \end{align}

We call this σi\sigma_{i} as a singular value of the matrix AA and the vv to be singular vector. The main difference between a singular value and an eigen value is that a singular value is always positive and an eigen value can be positive as well as negative. Now we shall come back to the point of finding the best fit subspace. We say that, the vectors viv_{i} are the orthonormal vectors that shall span the subspace VV and we need to find them out. Initially, we shall go by the basic method.

The Greedy Algorithm\underline{\text{The Greedy Algorithm}}

Let us say that we have nn points in dd dimensional space and the points are in rows of the matrix AA. The matrix is usually of low rank when we speak about these datasets. So, we plot them on the dd-dimensional space and figure out that one single vector vv which has the least perpendicular distance or the maximum projection of those points on to it. We shall call it the top singular vector. How shall we figure it out? It is that vector vv for which we have

v1=argmaxv=1Avv_{1} = arg \max_{\|v\| = 1} \|A v\|

We call this v1v_{1} as a top singular vector. and the length of this vector is the top singular value of AA.

σ1(A)=Av1\sigma_{1}(A) = |Av_{1}|

That would have been great if all the points lie over the line, but it generally is not the case, i.e., there can be a case, a little amount of points are not aligned on the line and we would now need a 2dimensional2-dimensional subspace. The question is how to find such subspace? It is finding a vector perpendicular to v1v_{1} and the subspace contains v1v_{1} along with all the points in the subspace. Now, we have all the points on a single plane. This makes the perpendicular distance from the positions of the points to the subspace to be zero. How do we know when to stop entertaining this kind of iteration? When we continue the process of finding a new vector and then checking the projection, if we find out that the projection on to the subspace is zero, then we shall find that we need to stop and no more new vectors are entertained.

Frobenius Norm\underline{\text{Frobenius Norm}}

Before defining Frobenius Norm, let us do some calculations. Let us say each row of matrix AA is aia_{i} and the vectors that span the whole rows of AA are v1,v2,,vrv_{1}, v_{2}, \dots, v_{r} then

j=1r(ajvi)2=aj2 \sum_{j = 1}^{r}(a_{j}\cdot v_{i})^2 = |a_{j}|^2

since the value of j=1r(ajvi)2\sum_{j = 1}^{r}(a_{j}\cdot v_{i})^2 is the projection onto the subspace generated by the vectors v1,v2,,vrv_{1}, v_{2}, \dots, v_{r} it is just the value of the magnitude squared of the row aja_{j} itself. When we speak of the whole matrix itself, it becomes as

j=1r(Avi)2=j=1rσi2(A) \sum_{j = 1}^{r}(A\cdot v_{i})^2 = \sum_{j = 1}^{r}\sigma_{i}^2(A)

But j=1naj2=j=1nk=1dajk2\sum_{j=1}^{n} \|a_j\|^2 = \sum_{j=1}^{n} \sum_{k=1}^{d} a_{jk}^2, the sum of squares of all the entries of AA. Thus the sum of squares of the singular values of all the entries. There is an important norm associated with this quantity, the Frobenious norm of AA, denoted as AF||A||_{F} and defined as

AF=j=1nk=1dajk2 ||A||_{F} = \sqrt{\sum_{j = 1}^{n} \sum_{k = 1}^{d} a_{jk}^2}

Best rank-kk approximation

Left singular vectors the uiu_{i}

Power method for SVD

Power method is an iterative method which we can approximately find the singular values and singular vectors. Given a matrix AA, which can be either a square matrix or a rectangular matrix, we can find the singular values and singular vectors of AA by finding the eigen values and eigen vectors of ATAA^{T}A or AATAA^{T}.

Let us say B=ATAB = A^{T}A, where the dimensions of A is n×dn \times d. Then the dimensions of BB is d×dd \times d. Then we can find the eigen values and eigen vectors of BB using the power method. In the power method, we find the eigen values by the following steps:

  1. Take any random vector of size d×1d \times 1 and let it be x0x_{0}.
  2. Multiply x0x_{0} by BB and let it be x1x_{1}.
  3. Multiply x1x_{1} by BB and let it be x2x_{2}.
  4. Continue this process till pp times to produce xpx_{p}.

Now, the eigen value and eigen vectors of BB is given by the following:

Since the first singular value of AA or first eigen value of BB is the largest, we can find it by the following:

λ1=xpTBxpxpTxp\lambda_{1} = \frac{x_{p}^{T} B x_{p}}{x_{p}^{T} x_{p}}

The first eigen vector e1e_{1} is found out by

e1=xpxpTxpe_{1} = \frac{x_{p}}{\sqrt{x_{p}^{T} x_{p}}}

The denominator in finding e1e_{1} is to make the vector of unit length and not to overflow the values.

The proof of the power method is as follows:

We know that λ1>λ2>>λd|\lambda_{1}| > |\lambda_{2}| > \dots > |\lambda_{d}|. and the eigen vectors are e1,e2,,ede_{1}, e_{2}, \dots, e_{d}.

Let us take any random vector x0x_{0} and let it be a linear combination of the eigen vectors.

x0=i=1dcieix_{0} = \sum_{i = 1}^{d} c_{i}e_{i}

Now, we multiply x0x_{0} by BB and let it be x1x_{1}. Iterate the step until pp no. of times.

xp=Bpx0=Bpi=1dcieix_{p} = B^{p}x_{0} = B^{p}\sum_{i = 1}^{d} c_{i}e_{i}

This can further be decomposed into

xp=i=1dciBpeix_{p} = \sum_{i = 1}^{d} c_{i}B^{p}e_{i}

Since eie_{i} is the eigen vector of BB, we have Bei=λieiBe_{i} = \lambda_{i}e_{i}.

xp=i=1dciλipeix_{p} = \sum_{i = 1}^{d} c_{i}\lambda_{i}^{p}e_{i}

We can further decompose it into

xp=λ1p(c1e1+c2(λ2λ1)pe2++cd(λdλ1)ped)x_{p} = \lambda_{1}^{p} \left( c_{1}e_{1} + c_{2}\left(\frac{\lambda_{2}}{\lambda_{1}}\right)^{p}e_{2} + \dots + c_{d}\left(\frac{\lambda_{d}}{\lambda_{1}}\right)^{p}e_{d} \right)

Since we know that λ1>λ2>>λd|\lambda_{1}| > |\lambda_{2}| > \dots > |\lambda_{d}|, we have λiλ1<1\left|\frac{\lambda_{i}}{\lambda_{1}}\right| < 1 for i=2,3,,di = 2, 3, \dots, d. As pp \to \infty, we have λiλ1p0\left|\frac{\lambda_{i}}{\lambda_{1}}\right|^{p} \to 0 for i=2,3,,di = 2, 3, \dots, d. Therefore, as pp \to \infty, we have

xpλ1pc1e1x_{p} \to \lambda_{1}^{p} c_{1}e_{1}

Thus, the first eigen value and eigen vector of BB is given by the following:

λ1=xpTBxpxpTxp\lambda_{1} = \frac{x_{p}^{T} B x_{p}}{x_{p}^{T} x_{p}} e1=xpxpTxpe_{1} = \frac{x_{p}}{\sqrt{x_{p}^{T} x_{p}}}

⚠️ I should mention that I am not quite sure of how did we end up in the closed form solution of the eigen values and eigen vectors. I need to work on it.