Principle Component Analysis

assumptions of PCA

this makes sense only when the data is linear
large variations have important structure (i.e. not noise)

motivation

Consider dataset \(\qty {x^{(i)}}_{i=1}^{n}\) such that \(x^{(i)} \in \mathbb{R}^{d}\), where \(d < n\). Our goal: we want to automatically identify such correlations between axes of data. This boils down to two goals:

automatically detect and remove redudancies
find features that explain the variation

preprocessing before PCA

normalize: \(x_{j}^{(i)} = \frac{x^{(i)}_{j} - \mu_{j}}{\sigma_{j}}\)

derivation

Find a hyperplane \(u\) such that if data is projected onto \(u\), the variance of the projected data is maximized. Namely, for unit vector \(u\) representing the hyperplane, we want to find \(u\) such that we ca maximize:

\begin{align} \frac{1}{n} \sum_{i=1}^{n} \qty(x^{(i)}^{T} u)^{2} &= \frac{1}{n} \sum_{i=1}^{n} u^{T} x^{(i)} x^{(i)}^{T} u \\ &= u^{T} \qty(\frac{1}{n} \sum_{i=1}^{n} x^{(i)} x^{(i)}^{T}) u \end{align}

Notice the maximizing choice of \(u\) should be the principle eigenvector of \(\frac{1}{n} \sum_{i=1}^{n} x^{(i)} x^{(i)}^{T}\) (its the vector that does no rotation and just stretch.) To show this for yourself, use the method of Lagrange multipliers.

Once you do this, you can then project:

\begin{equation} x \approx \qty(x^{T}u )u \end{equation}

you can find more than one dimensions by choosing increasingly smaller eigenvectors corresponding to \(u\):

\begin{equation} x \approx \sum_{i=1}^{k} \qty(x^{T}u_{i}) u_{i} \end{equation}

connecting SVD and PCA

Recall in PCA we obtain the matrix:

\begin{equation} \qty(\frac{1}{n} \sum_{i=1}^{n} x^{(i)} x^{(i)}^{T}) \end{equation}

for which we need the eigenvalue. Notice, we can write via SVD:

\begin{equation} X = U S V^{T} \end{equation}

Now, the transposes are:

\begin{equation} X^{T} = V S U^{T} \end{equation}

this gives:

\begin{align} X X^{T} &=U S V^{T} V S U^{T} \\ &= U S S U^{T} \\ &= \qty(US) \qty(US)^{T} \end{align}

This gives that \(U\) is the eigenvectors of \(XX^{T}\), this is nice because then we can just use the SVD to give us the PCA vectors.

error

We can consider the error of PCA as:

\begin{equation} x^{(i)} - \sum_{j=1}^{k} \qty(x^{(i)}^{T}u_{j}) u_{j} \end{equation}

applications

visualization
preprocessing to Prudence dimensions
reduce data noise (which is getting rid of extra axes

some recent work

Candes, Li, Ma, Wright: “Robust PCA” — minimize \(\mid X - L\mid\) such that \(\text{rank}\qty(L) \leq K\); we find that the right minimizer is \(\mid L\mid_{*} + \lambda \mid S\mid_{1}\), where \(\mid L\mid_{*}\) is the sum of the singular values of \(L\), and \(\mid S\mid_{1}\) is just a rank 1 norm