assumptions of PCA
- this makes sense only when the data is linear
- large variations have important structure (i.e. not noise)
motivation
Consider dataset \(\qty {x^{(i)}}_{i=1}^{n}\) such that \(x^{(i)} \in \mathbb{R}^{d}\), where \(d < n\). Our goal: we want to automatically identify such correlations between axes of data. This boils down to two goals:
- automatically detect and remove redudancies
- find features that explain the variation
preprocessing before PCA
normalize: \(x_{j}^{(i)} = \frac{x^{(i)}_{j} - \mu_{j}}{\sigma_{j}}\)
derivation
Find a hyperplane \(u\) such that if data is projected onto \(u\), the variance of the projected data is maximized. Namely, for unit vector \(u\) representing the hyperplane, we want to find \(u\) such that we ca maximize:
\begin{align} \frac{1}{n} \sum_{i=1}^{n} \qty(x^{(i)}^{T} u)^{2} &= \frac{1}{n} \sum_{i=1}^{n} u^{T} x^{(i)} x^{(i)}^{T} u \\ &= u^{T} \qty(\frac{1}{n} \sum_{i=1}^{n} x^{(i)} x^{(i)}^{T}) u \end{align}
Notice the maximizing choice of \(u\) should be the principle eigenvector of \(\frac{1}{n} \sum_{i=1}^{n} x^{(i)} x^{(i)}^{T}\) (its the vector that does no rotation and just stretch.) To show this for yourself, use the method of Lagrange multipliers.
Once you do this, you can then project:
\begin{equation} x \approx \qty(x^{T}u )u \end{equation}
you can find more than one dimensions by choosing increasingly smaller eigenvectors corresponding to \(u\):
\begin{equation} x \approx \sum_{i=1}^{k} \qty(x^{T}u_{i}) u_{i} \end{equation}
connecting SVD and PCA
Recall in PCA we obtain the matrix:
\begin{equation} \qty(\frac{1}{n} \sum_{i=1}^{n} x^{(i)} x^{(i)}^{T}) \end{equation}
for which we need the eigenvalue. Notice, we can write via SVD:
\begin{equation} X = U S V^{T} \end{equation}
Now, the transposes are:
\begin{equation} X^{T} = V S U^{T} \end{equation}
this gives:
\begin{align} X X^{T} &=U S V^{T} V S U^{T} \\ &= U S S U^{T} \\ &= \qty(US) \qty(US)^{T} \end{align}
This gives that \(U\) is the eigenvectors of \(XX^{T}\), this is nice because then we can just use the SVD to give us the PCA vectors.
error
We can consider the error of PCA as:
\begin{equation} x^{(i)} - \sum_{j=1}^{k} \qty(x^{(i)}^{T}u_{j}) u_{j} \end{equation}
applications
- visualization
- preprocessing to Prudence dimensions
- reduce data noise (which is getting rid of extra axes
some recent work
Candes, Li, Ma, Wright: “Robust PCA” — minimize \(\mid X - L\mid\) such that \(\text{rank}\qty(L) \leq K\); we find that the right minimizer is \(\mid L\mid_{*} + \lambda \mid S\mid_{1}\), where \(\mid L\mid_{*}\) is the sum of the singular values of \(L\), and \(\mid S\mid_{1}\) is just a rank 1 norm
