Houjun Liu

singular value decomposition

Singular value decomposition is a factorization of a matrix, which is a generalization of the eigendecomposition of normal matricies (i.e. where \(A = V^{-1} D V\) when \(A\) is diagonalizable, i.e. by the spectral theorem possible when matricies are normal).

Definitions

Singular value decomposition Every \(m \times n\) matrix has a factorization of the form:

\begin{equation} M = U D^{\frac{1}{2}} V^{*} \end{equation}

where, \(U\) is an unitary matrix, \(D^{\frac{1}{2}}\) a diagonalish (i.e. rectangular diagonal) matrix with non-negative numbers on its diagonal called singular values, which are the positive square roots of eigenvalues of \(M^{* }M\) — meaning the diagonal of \(D^{\frac{1}{2}}\) is non-negative (\(\geq 0\)). Finally, \(V\) is formed columns of orthonormal bases of eigenvectors of \(M^{*}M\).

SVD is not technically unique, but we like to force a specific (convenient, see proof for why) ordering: where \(D^{\frac{1}{2}}\) (and the corresponding values in \(V^{*}\)) is sorted such that the zero values are to the right.

Doing It

Doing SVD is not actually super duper hard, but it takes some thinking on why it works, which we shall do below.

Recall that \(V^{* }\) is the conjugate transpose of the orthonormal eigenvectors of \(M^{*} M\). Then, we construct the square roots of the corresponding eigenvalues and arrange them into \(D^{\frac{1}{2}}\).


Tangent:

Why is it we can take square roots of these values (i.e. the eigenvalues are guaranteed positive or zero?) Recall the definition of adjoint:

\begin{equation} \langle Tv, w \rangle = \langle v, T^{*}w \rangle \end{equation}

Applying it here, we have

\begin{equation} \langle M^{*}M v, v \rangle = \langle M v, M v \rangle \end{equation}

And recall that, by definition of inner product, \(\langle Mv, Mv \rangle \geq 0\), and so \(\|Mv\|^{2} \geq 0\) and so \(\|Mv\| \geq 0\) so \(\| \lambda v \| \geq 0\).


And so you can take the square roots of those singular values (i.e. square roots of eigenvalues of \(M^{*}M\)).

How do we get \(U\)? Well recall:

\begin{equation} M = U D^{\frac{1}{2}} V^{*} \end{equation}

And \(V\) is an operator lined with orthornomal eigenbases so it is unitary and so \(V = (V^{*})^{-1}\).

And therefore, we apply \(V\) on both sides:

\begin{equation} MV = UD^{\frac{1}{2}} \end{equation}

As \(D\) is diagonal, and we know the left side, we can then easily recover \(U\) by staring at it (and norming the vectors).

Motivation and Proof

Beginning Motivation

We have a matrix \(M\) of shape \(m \times n\), it sucks: it may not be normal, it may not even be an operator.

So consider now:

\begin{equation} M^{*} M \end{equation}

you will note that this is now an operator (\((n \times m)(m \times n) = n \times n\))!! Not only that, \(M^{*}M\) is self-adjoint (\((M^{*}M)^{*} = M^{*}(M^{*})^{*} = M^{*}M\)). Of course self-adjoint matricies are normal, which is nice, so spectral theorem applies here (even the real version because self-adjoint!)

Eigendecomposition of \(M^{*}M\)

So, by the spectral theorem, there are a basis of orthonormal eigenvectors \(v_1, \dots v_{n}\) of \(M^{*}M\) such that:

Given:

\begin{equation} V = \mqty(v_1 & \dots & v_{n}) \end{equation}

we have

\begin{equation} M^{*}M = V D_0 V^{-1} \end{equation}

i.e. this is the eigendecomposition (“similar to diagonal”) result we had from before, where \(D_0\) is a Diagonal Matrix of eigenvalues on the diagonal.

Swapping the direction of conjugation, to expose the diagonal matrix by itself, we have:

\begin{equation} D_0 = V^{-1} M^{*} M V \end{equation}

You will NOTICE! The spectral theorem gives us that \(v_1, … v_{n}\) is not only a basis of eigenvectors, but an ORTHONORMAL basis of eigenvectors. So \(V\) is an operator with orthogonal columns. And so, because of this result, we have that: \(V^{*} = V^{-1}\).

Substituting this in, we have:

\begin{equation} D_0 = V^{*} M^{*} M V \end{equation}

Aside #1: zero-eigenvalue eigenvector ordering

To make this better, we can order \(v_1, \dots v_{n}\) such that eigenvectors vectors corresponding to \(\lambda = 0\) comes last.

And so we make a \(V\):

\begin{equation} V = \mqty(v_1 &\dots &v_{n-p} & v_{n-p+1} &\dots &v_{n}) \end{equation}

So we have two sub-matricies: an matrix \(V_1\) of shape \((n, n-p)\) which is filled by eigenvectors corresponding to eigenvalues not \(=0\), and the other matrix \(V_2\) of shape \((n,p)\) which is made of eigenvectors corresponding to zero eigenvalues.

That is:

\begin{equation} \begin{cases} V_1 = \mqty(v_1 & \dots & v_{n-p}) \\ V_1 = \mqty(v_{n-p+1} & \dots & v_{n}) \\ \end{cases} \end{equation}

and

\begin{equation} V = \mqty(V_1 & V_2) \end{equation}

where, \(v_1, …, v_{n-p}\) are orthonormal eigenvectors corresponding to non-zero eigenvalues, and \(v_{n-p+1}, …, v_{n}\) are that corresponding to zero eigenvalue.

Furthermore, this ordering of the eigenvectors can help us better clarify what \(D_0\) is:

\begin{equation} D_0 = \mqty(D & 0 \\ 0 & 0) \end{equation}

Where, \(D\) is a Diagonal Matrix with a strictly positive diagonal as the non-diagonals are zero by definition, the lower-right quadrant is \(0\) because the sub-part of \(V_2\) are eigenvectors corresponding to the zero eigenvalue.

Applying \(V_1, V_2\) breakup from aside above

Ok, recall where we were:

\begin{equation} D_0 = V^{*} M^{*} M V \end{equation}

Applying the substitutions from above:

\begin{equation} \mqty(V_1^{*} \\ V_2^{*}) M^{*} M \mqty(V_1\ V_2) = \mqty(D & 0 \\ 0 & 0) \end{equation}

Now, recall how matricies multiply:

\begin{align} &\mqty(V_1^{*} \\ V_2^{*}) M^{*} M \mqty(V_1\ V_2) = \mqty(D & 0 \\ 0 & 0)\\ \Rightarrow\ &\mqty(V_1^{*} \\ V_2^{*}) \mqty(M^{*} M V_1\ M^{*} M V_2) = \mqty(D & 0 \\ 0 & 0) \\ \Rightarrow\ & \mqty(V_1^{*} M^{*} M V_1 & V_1^{*} M^{*} M V_2 \\ V_2^{*}M^{*} M V_1 & V_2^{*} M^{*} M V_2) = \mqty(D & 0 \\ 0 & 0) \end{align}

Aside #2: \(A^{*} A = 0 \implies A=0\)

Take the construction:

\begin{equation} A^{*} A = 0 \end{equation}

we desire that \(A = 0\).

Recall the definition of \(A^{*}\):

\begin{equation} \langle Av, w \rangle = \langle v, A^{*}w \rangle \end{equation}

for all \(v,w\).

Now, consider:

\begin{equation} \langle A^{*} Av, w \rangle = \langle A^{*} (Av), w \rangle = \langle Av, (A^{*})^{*}w \rangle = \langle Av, Aw \rangle \end{equation}

Applying the above, finally, consider:

\begin{equation} \|Av\|^{2} = \langle Av, Av \rangle = \langle A^{*}A v, v \rangle \end{equation}

Recall that \(A^{*}A = 0\), so:

\begin{equation} \|Av\|^{2} = \langle A^{*}A v, v \rangle = \langle 0v,v \rangle = 0 \end{equation}

So, the norm of \(Av = 0\) for all \(v \in V\), which means \(A\) produces only \(0\) vectors, which means \(A=0\), as desired.

Breaking \(V_{j}^{*} M^{*}M V_{j}\) up

Recall where we ended up at:

\begin{equation} \mqty(V_1^{*} M^{*} M V_1 & V_1^{*} M^{*} M V_2 \\ V_2^{*}M^{*} M V_1 & V_2^{*} M^{*} M V_2) = \mqty(D & 0 \\ 0 & 0) \end{equation}

Consider its diagonals:

\begin{equation} \begin{cases} V_1^{*} M^{*} M V_1 = D \\ V_2^{*} M^{*} M V_2 = 0 \end{cases} \end{equation}

Now, for the second expression, we have: \(V_2^{*}M^{*}MV_{2} = (M V_2)^{*} (M V_2) = 0\). So, from the result above (that \(A^{*}A = 0 \implies A=0\)), we have that \(MV_{2} = 0\).

Aside #3: \(V_1 V_1^{*} + V_2V_2^{*} = I\)

Consider:

\begin{equation} V_1 V_1^{*} \end{equation}

The matrix \(V_1\) has shape \((n, n-p)\), and this makes \(V_1^{* }\) have shape \((n-p, n)\). You will, therefore, note that \(V_{1}^{* }\) is a map from a vector space of dimension \(n\) to that in a dimension \(n-p\). This map, then, is not injective when \(p\neq 0\). Therefore, the overall operator \(V_1 V_1^{* }\) is also not going to be injective because non-zero is going to be sent by \(V_1^{* }\) to \(0\), then sent still by \(V_1\) to \(0\). This also means that \(V_1 V_1^{*}\) is not invertable.

Yet, we are trying to show \(V_1 V_1^{*} + V_2 V_2^{*} = I\), which is the sum of these two noninvertible map, is \(I\): the grandaddy of all invertible maps. What gives?

Recall that:

\begin{equation} \begin{cases} \mqty(V_1 & V_2) = V \\ V V^{*} = I \end{cases} \end{equation}

The first result is by definition, the second because \(V\) is an orthonormal operator so it is unitary.

Let us begin with:

\begin{align} I &= V V^{*} \\ &= \mqty(V_1 & V_2) \mqty(V_1 & V_2)^{*} \\ &= \mqty(V_1 & V_2) \mqty(V_1^{*} \\ V_2^{*}) \\ &= V_1V_1^{*} + V_2 V_2^{*} \end{align}

And the last equation simply comes from how matrices multiply: row by column. And so, weirdly, we can confirm that adding non-full rank matricies and end up to be the identity. So, again:

\begin{equation} V_1 V_1^{*} + V_2V_2^{*} = I \end{equation}

Constructing \(U_1\)

With the result above, we are finally close to doing what we want to do. Recall our last set of conclusions:

one, that:

\begin{equation} \begin{cases} V_1^{*} M^{*} M V_1 = D \\ V_2^{*} M^{*} M V_2 = 0 \end{cases} \end{equation}

and specifically, that \(MV_{2} = 0\).

and two, that:

\begin{align} &V_1 V_1^{* } + V_2V_2^{* } = I \\ \Rightarrow\ & V_1 V_1^{* } = I - V_2V_2^{* } \end{align}

Let’s now turn our attention to \(D\) above. It has all non-zero diagonals, because we cropped out the zero already (see above during the definition of \(D\) vis a vi \(D_0\)). This means it is invertible because operator is only invertible if diagonal of its upper-triangular matrix is nonzero. For a diagonal matrix, this is particularly easy; let us construct:

\begin{equation} D = D^{\frac{1}{2}} D^{\frac{1}{2}} \end{equation}

where, \(D^{\frac{1}{2}}\) is just the same diagonal matrix as \(D\) except we take the square root of everything in the diagonal. The above could be shown then to be true by calculation (\(\sqrt{a}\sqrt{a} = a\) on every element in the diagonal).

Let us also make:

\begin{equation} I = D^{-\frac{1}{2}} D^{\frac{1}{2}} \end{equation}

where, \(D^{-\frac{1}{2}}\) is \(\frac{1}{\sqrt{a}}\) for event element \(a\) in the diagonal. Again, the above could be shown to be true by calculation by \(\sqrt{a} \frac{1}{\sqrt{a}} = 1\).

Given the diagonal of \(D\) contains the eigenvalues of \(M^{*}M\), by calculation \(D^{\frac{1}{2}}\) contains the square roots of these eigenvalues, which means that it should contain on its diagonal the singular values of \(M\), which is rather nice (because we have corollaries below that show concordance between singular values of \(M\) and its eigenvalues, see below).

Consider, finally, the matrix \(M\):

\begin{align} M &= M - 0 \\ &= M - 0 V_2^{* } \\ &= M - (M V_2) V_2^{* } \\ &= M (I - V_2 V_2^{* }) \\ &= M V_1V_1^{*} \\ &= M V_1 I V_1^{*} \\ &= M V_1 (D^{-\frac{1}{2}} D^{\frac{1}{2}}) V_1^{*} \\ &= (M V_1 D^{-\frac{1}{2}}) D^{\frac{1}{2}} V_1^{*} \end{align}

We now define a matrix \(U_1\):

\begin{equation} U_1 = M V_1 D^{-\frac{1}{2}} \end{equation}

We now have:

\begin{equation} M = U_1 D^{\frac{1}{2}} V_1^{*} \end{equation}

Were \(U_1\) is a matrix of shape \((m \times n)(n \times n-p)(n-p \times n-p) = m \times n-p\), \(D^{\frac{1}{2}}\) is a diagonal matrix of shape \(n-p \times n-p\) with singular values on the diagonal, and \(V_1^{*}\) is a matrix with orthonormal rows of shape \(n-p \times n\).

This is a compact svd. We are sandwitching a diagonal matrix of singular values between two rectangular matricies to recover \(M\). Ideally, we want the left and right matricies too to have nice properties (like, say, be an operator or have unitarity). So we work harder.

Aside #4: \(U_1\) is orthonormal

We can’t actually claim \(U_1\) is unitary because its not an operator. However, we like to show its columns are orthonormal so far so we can extend it into a fully, actually, unitary matrix.

One sign that a matrix is orthonormal is if \(T^{*}T = I\). Because of the way that matricies multiply, this holds IFF each column yields a \(1\) when its own conjugate transpose is applied, and \(0\) otherwise. This is also the definition of orthonormality.

Therefore, we desire \(U_{1}^{*} U_1 = I\). We hence consider:

\begin{equation} U_1^{*} U_1 \end{equation}

We have by substitution of \(U_1 = M V_1 D^{-\frac{1}{2}}\):

\begin{equation} (M V_1 D^{-\frac{1}{2}})^{*}(M V_1 D^{-\frac{1}{2}}) \end{equation}

Given the property that \((AB)^{*} = B^{*}A^{*}\), we have that:

\begin{equation} (M V_1 D^{-\frac{1}{2}})^{*}(M V_1 D^{-\frac{1}{2}}) = {D^{-\frac{1}{2}}}^{*} V_1^{*} M^{*}M V_1 D^{-\frac{1}{2}} \end{equation}

Recall now that, from way before, we have:

\begin{equation} V_1^{*} M^{*} M V_1 = D \end{equation}

Substituting that in:

\begin{align} {D^{-\frac{1}{2}}}^{*} V_1^{*} M^{*}M V_1 D^{-\frac{1}{2}} &= {D^{-\frac{1}{2}}}^{*} (V_1^{*} M^{*}M V_1) D^{-\frac{1}{2}} \\ &= {D^{-\frac{1}{2}}}^{*} D D^{-\frac{1}{2}} \end{align}

Recall now that the multiplication of diagonal matricies are commutative (by calculation), and that diagonal real matricies are self-adjoint (try conjugate-transposing a real diagonal matrix). We know that \(D^{-\frac{1}{2}}\) is real (because its filled with the square roots of the eigenvalues of \(M^{*}M\), which is self-adjoint, and eigenvalues of self-adjoint matricies are real) and is by definition diagonal. So we have that \(D^{-\frac{1}{2}}\) is self-adjoint.

Taking those facts in mind, we can now rewrite this expression:

\begin{align} {D^{-\frac{1}{2}}}^{*} V_1^{*} M^{*}M V_1 D^{-\frac{1}{2}} &= {D^{-\frac{1}{2}}}^{*} D D^{-\frac{1}{2}} \\ &= D^{-\frac{1}{2}} D D^{-\frac{1}{2}} \\ &= D^{-\frac{1}{2}}D^{-\frac{1}{2}} D \\ &= D^{-1} D \\ &= I \end{align}

Therefore, \(U_1^{*} U_1 = I\) as desired, so the columns of \(U_1\) is orthonormal.

SVD, fully

Recall that, so far, we have:

\begin{equation} M = U_1 D^{\frac{1}{2}} V_1^{*} \end{equation}

where

\begin{equation} U_1 = M V_1 D^{-\frac{1}{2}} \end{equation}

So far, \(U_1\) and \(V_1^{*}\) are both disappointingly not operators. However, we know that \(U_1\) and \(V_1\) are both orthonormal (the former per aside #4 above, the latter by the spectral theorem and construction above). So wouldn’t it be doubleplusgood for both of them to be unitary operators?

To make this happen, we need to change the shapes of things a little without changing how the matricies behave. That is: we want \(U\) and \(V^{* }\) to both be operators, and yet still have \(U D^{\frac{1}{2}} V^{*} = M\).

Padding out \(D\) and \(V\)

There are immediate and direct ways of padding out \(D^{\frac{1}{2}}\) and \(V_{1}^{*}\): let us replace \(V_1 \implies V\), and just shove enough zeros into \(D\) such that the dimensions work out (changing it from shape \(n-p \times n-p\) to \(m \times n\), but do this by just adding enough zeros on the edges until it works).

So first, \(D^{\frac{1}{2}}\) becomes:

\begin{equation} D^{\frac{1}{2}}_{new} = \mqty(D^{\frac{1}{2}}_{orig} & 0 &\dots \\ 0 & 0 &\dots \\ 0&0&\dots) \end{equation}

(the number of zeros on the edge vary based on the proportions of \(n, p, m\)).

Why would this always be padding? i.e. why is \(n-p \leq m\)? Here’s a hand-wavy proof that the reader can choose to fill in the gaps of: consider the fact that \(M\)’s shape is \(m \times n\). Specifically, this means that \(M: V \to W\) where \(\dim V = n\) and \(\dim W = m\). Say for the sake of argument \(n> m\) (otherwise naturally \(n-p \leq m\) because \(n\leq m\)). Consider \(null\ M\); given it is a map from a larger space to a smaller space, there’s going to be a non-trivial null space. This non-trivial null space is going to be as large or larger than \(m-n\); and the null space of \(M^{*}M\) will be at least as large as \(m-n\) as well because everything is sent through \(M\) first. And then applying rank nullity can show that \(m \geq \dim\ range\ M^{ *}M\). Therefore, the number of non-zero eigenvalues of \(M^{ *}M\), which also corresponds to the number of non-zero columns of \(D\), which also is \(n-p\), must be smaller than or equal to \(m\) because otherwise the diagonal representation would have too many linearly independent columns (i.e. more lin. indp. columns that the rank which is impossible).

Now, we have

\begin{equation} V = \mqty(V_1 & V_2) \end{equation}

where \(V_1\) is a matrix whose columns are the non-zero eigenvalue correlated eigenvectors, and the columns of \(V_1\) the zero-eigenvalue related ones.

Note, now that:

\(D^{\frac{1}{2}}_{new} V^{* }\) is an \(m \times n\) matrix that behaves almost exactly like \(D^{\frac{1}{2}}_{orig} V_1^{*}\), a \(n-p \times n\) matrix. The last \(m-(n-p)\) (as we established before, \(m \geq n-p\)) dimensions of the new, padded matrix’s output will simply be \(0\): because recall that \(DT\) for some diagonal matrix \(D\) scales the rows of \(T\): and the first \(n-p\) rows (corresponding to the columns of \(V_1\), because recall we are applying \(V\) not \(V^{ *}\) to \(D\)) will be scaled normally, and the last \(m-(n-p)\) rows will be scaled by \(0\) as they are a part of the padded zero-diagonal.

Padding out \(U\)

With \(D\) and \(V\) padded, its time to deal with \(U\). Fortunately, recall that the last bit of the output of \(DV\) will just be \(0\): so whatever we stick in terms of columns of \(V\) for those slots will never actually be added to the final output. In a sense, they don’t really matter.

The first \(n-p\) of \(U\) (i.e. \(U_{1}\)) we already have a well-defined answer: recall from before \(U_1 = M D^{-\frac{1}{2}} V_{1}^{*}\). So the last bit we can just stick on literally whatever to make it work.

And by “making it work”, we literally just mean extending the columns of \(U_1\) until you have \(m\) linearly-independent of them, then Gram-Schmidtting to make it all orthonormal. The first \(n-p\) columns will not be affected by Gram-Schmidtting, as we have established before \(U_1\) is orthonormal.

Again, these are basically arbitrary: no one cares because these columns will always be scaled by \(0\) as they are part of the “padding columns” from padding \(D^{\frac{1}{2}}\) out.

and so, finally

Finally, we now have:

\begin{equation} M = U D^{\frac{1}{2}} V^{*} \end{equation}

where, \(U\) is an \(m \times m\) unitary operator, \(D^{\frac{1}{2}}\) is an \(m \times n\) semidiagonal matrix (diagonal \(n-p \times n-p\) part, then \(0\) padding all around) filled with singular values of \(M\) on the diagonal, and \(V^{*}\) is an \(n \times n\) unitary operator filled with orthonormal rows of right singular-vectors (i.e. eigenvectors of \(M^{ *}M\)).

Useful corollaries

If \(\lambda\) is an non-negative real eigenvalue of \(M\), then \(\lambda\) is sometimes a singular value of \(M\)

Consider the matrix:

\begin{equation} \mqty(1 & 1 \\0 & 0) \end{equation}

However, the statement is the case if \(M\) is already diagonalizable, then in which case you can imagine constructing \(M^{* }M\) to be vis a vi the eigenbasis of \(M\), which means that the resulting diagonal representation of \(M^{*}M\) would just be the eigenvalues of \(M\) squared as you are multiplying a diagonal matrix by itself.