Singular Value Decomposition

Singular Value Decomposition (SVD)

Before talking about SVD, let us discuss the root cause why we wanted one like this.

In linear algebra, people have some high beauty standards. The love for square matrices. These were some demigods to them. They feel that they have so many good properties. But due to the extreme love towards these kind of matrices, the rectangular matrices were left behind.

Let us discuss one of the main thing that people wanted to deal even with rectangular matrices. The diagonalization of a matrix. This is done in square matrices and it is a decomposition of a square matrix into 3 different matrices

$A = PDP^{-1}$

This kind of decomposition helps people in a lot of ways and is considered good. One of the example is $A^{n} = PD^{n}P^{-1}$ and many other things which is not in the scope of this text.

So, thus came the idea of SVD, as the diagonalization is the thing only for square matrices. In rectangular matrices, we can't do this kind of decomposition. Hence we sought of a different technique.

Instead of decomposing the matrix $A$ we decompose the matrix $A^{T}A$ and find things out for this. Even finding a determinant for a rectangular matrix uses this technique and is called a pseudo inverse. So, let us deal with the new thing.

Before going into the SVD, I shall write some of the questions I have got while reading the text on SVD in the book Foundations of Data Science by Blum.

If there is a matrix of d rows but the rank is k given $k < d$ . Then what is the best subspace $V$ for which the matrix can be shrinked to?

To know the answer, we shall first go through the situations where this comes helpful in real life. Let us taake the example of Machine learning. In this field, we typically deal with huge datasets which are often of low rank than the full rank of the matrix. In these situations, we prefer to fit the matrix into a subspace of lower dimensions which would eventually help us in doing the math over it. Great, it seems well enough. Then how?

The best fit subspace

Given that we have a matrix $A$ , and we want to find the best fit subspace, we denote it with the following (calm down, I shall dissect each of them).

$A = UDV^{-1}$

let us say that the rows in $U$ are $u_i$ and the rows of $V$ are $v_i$ . The entries in the matrix $D$ are on the principal diagonal and is a diagonal matrix. We define each of the rows in each matrix soon.

Let us say we have got some "kind of" eigen values and eigen vectors for the matrix $A^{T}A$ . So, we follow the following procedure.

\begin{align} A^{T}A v &= \sigma_{i}^2 v \\ \sigma_{i} &= \sqrt{\lambda_{i}(A^{T}A)} \\ &= \sqrt{|\lambda_{i}|^2} \; \text{(Since it is a kind of square of the matrix, we will be getting the eigen value squared)}\\ &= |\lambda_{i}| \end{align}

We call this $\sigma_{i}$ as a singular value of the matrix $A$ and the $v$ to be singular vector. The main difference between a singular value and an eigen value is that a singular value is always positive and an eigen value can be positive as well as negative. Now we shall come back to the point of finding the best fit subspace. We say that, the vectors $v_{i}$ are the orthonormal vectors that shall span the subspace $V$ and we need to find them out. Initially, we shall go by the basic method.

$\underline{\text{The Greedy Algorithm}}$

Let us say that we have $n$ points in $d$ dimensional space and the points are in rows of the matrix $A$ . The matrix is usually of low rank when we speak about these datasets. So, we plot them on the $d$ -dimensional space and figure out that one single vector $v$ which has the least perpendicular distance or the maximum projection of those points on to it. We shall call it the top singular vector. How shall we figure it out? It is that vector $v$ for which we have

v_{1} = arg \max_{\|v\| = 1} \|A v\|

We call this $v_{1}$ as a top singular vector. and the length of this vector is the top singular value of $A$ .

\sigma_{1}(A) = |Av_{1}|

That would have been great if all the points lie over the line, but it generally is not the case, i.e., there can be a case, a little amount of points are not aligned on the line and we would now need a $2-dimensional$ subspace. The question is how to find such subspace? It is finding a vector perpendicular to $v_{1}$ and the subspace contains $v_{1}$ along with all the points in the subspace. Now, we have all the points on a single plane. This makes the perpendicular distance from the positions of the points to the subspace to be zero. How do we know when to stop entertaining this kind of iteration? When we continue the process of finding a new vector and then checking the projection, if we find out that the projection on to the subspace is zero, then we shall find that we need to stop and no more new vectors are entertained.

$\underline{\text{Frobenius Norm}}$

Before defining Frobenius Norm, let us do some calculations. Let us say each row of matrix $A$ is $a_{i}$ and the vectors that span the whole rows of $A$ are $v_{1}, v_{2}, \dots, v_{r}$ then

\sum_{j = 1}^{r}(a_{j}\cdot v_{i})^2 = |a_{j}|^2

since the value of $\sum_{j = 1}^{r}(a_{j}\cdot v_{i})^2$ is the projection onto the subspace generated by the vectors $v_{1}, v_{2}, \dots, v_{r}$ it is just the value of the magnitude squared of the row $a_{j}$ itself. When we speak of the whole matrix itself, it becomes as

\sum_{j = 1}^{r}(A\cdot v_{i})^2 = \sum_{j = 1}^{r}\sigma_{i}^2(A)

But $\sum_{j=1}^{n} \|a_j\|^2 = \sum_{j=1}^{n} \sum_{k=1}^{d} a_{jk}^2$ , the sum of squares of all the entries of $A$ . Thus the sum of squares of the singular values of all the entries. There is an important norm associated with this quantity, the Frobenious norm of $A$ , denoted as $||A||_{F}$ and defined as

||A||_{F} = \sqrt{\sum_{j = 1}^{n} \sum_{k = 1}^{d} a_{jk}^2}

Best rank- $k$ approximation

Left singular vectors the $u_{i}$

Power method for SVD

Power method is an iterative method which we can approximately find the singular values and singular vectors. Given a matrix $A$ , which can be either a square matrix or a rectangular matrix, we can find the singular values and singular vectors of $A$ by finding the eigen values and eigen vectors of $A^{T}A$ or $AA^{T}$ .

Let us say $B = A^{T}A$ , where the dimensions of A is $n \times d$ . Then the dimensions of $B$ is $d \times d$ . Then we can find the eigen values and eigen vectors of $B$ using the power method. In the power method, we find the eigen values by the following steps:

Take any random vector of size $d \times 1$ and let it be $x_{0}$ .
Multiply $x_{0}$ by $B$ and let it be $x_{1}$ .
Multiply $x_{1}$ by $B$ and let it be $x_{2}$ .
Continue this process till $p$ times to produce $x_{p}$ .

Now, the eigen value and eigen vectors of $B$ is given by the following:

Since the first singular value of $A$ or first eigen value of $B$ is the largest, we can find it by the following:

\lambda_{1} = \frac{x_{p}^{T} B x_{p}}{x_{p}^{T} x_{p}}

The first eigen vector $e_{1}$ is found out by

e_{1} = \frac{x_{p}}{\sqrt{x_{p}^{T} x_{p}}}

The denominator in finding $e_{1}$ is to make the vector of unit length and not to overflow the values.

The proof of the power method is as follows:

We know that $|\lambda_{1}| > |\lambda_{2}| > \dots > |\lambda_{d}|$ . and the eigen vectors are $e_{1}, e_{2}, \dots, e_{d}$ .

Let us take any random vector $x_{0}$ and let it be a linear combination of the eigen vectors.

x_{0} = \sum_{i = 1}^{d} c_{i}e_{i}

Now, we multiply $x_{0}$ by $B$ and let it be $x_{1}$ . Iterate the step until $p$ no. of times.

x_{p} = B^{p}x_{0} = B^{p}\sum_{i = 1}^{d} c_{i}e_{i}

This can further be decomposed into

x_{p} = \sum_{i = 1}^{d} c_{i}B^{p}e_{i}

Since $e_{i}$ is the eigen vector of $B$ , we have $Be_{i} = \lambda_{i}e_{i}$ .

x_{p} = \sum_{i = 1}^{d} c_{i}\lambda_{i}^{p}e_{i}

We can further decompose it into

x_{p} = \lambda_{1}^{p} \left( c_{1}e_{1} + c_{2}\left(\frac{\lambda_{2}}{\lambda_{1}}\right)^{p}e_{2} + \dots + c_{d}\left(\frac{\lambda_{d}}{\lambda_{1}}\right)^{p}e_{d} \right)

Since we know that $|\lambda_{1}| > |\lambda_{2}| > \dots > |\lambda_{d}|$ , we have $\left|\frac{\lambda_{i}}{\lambda_{1}}\right| < 1$ for $i = 2, 3, \dots, d$ . As $p \to \infty$ , we have $\left|\frac{\lambda_{i}}{\lambda_{1}}\right|^{p} \to 0$ for $i = 2, 3, \dots, d$ . Therefore, as $p \to \infty$ , we have

x_{p} \to \lambda_{1}^{p} c_{1}e_{1}

Thus, the first eigen value and eigen vector of $B$ is given by the following:

\lambda_{1} = \frac{x_{p}^{T} B x_{p}}{x_{p}^{T} x_{p}}

e_{1} = \frac{x_{p}}{\sqrt{x_{p}^{T} x_{p}}}

⚠️ I should mention that I am not quite sure of how did we end up in the closed form solution of the eigen values and eigen vectors. I need to work on it.