SU-CS205L JAN282025

Gram-Schmidt For Matrix Orthogonality

You can use Gram-Schmidt to find matrix orthogonality. In particular, for a series of vectors \(s^{(j)}\) forming a matrix \(A\):

\begin{equation} s^{(q)} = s^{(q)}- \sum_{q’=1}^{q-1} \frac{\langle s^{(q)}, s^{(q’)} \rangle_{A}}{\langle s^{(q’)}, s^{(q’)} \rangle_{A}}s^{(q’)} \end{equation}

for Conjugate Gradient, it works out such that only one such dot products is non-zero, so we can write:

\begin{equation} s^{(q)} = r^{(q)} + \frac{r^{(q)}\cdot r^{(q)}}{r^{(q-1)}\cdot r^{(q-1)}} s^{(q-1)} \end{equation}

for residual \(r^{(q)}\), and

Conjugate Gradient

For \(Ac = b\), let’s write:

Start with \(s^{(i)} = r^{(i)} = b - Ac^{(i)}\)

Iteration:

\(\alpha^{(q)} = \frac{r^{(q)}\cdot r^{(q)}}{\langle s^{(q)}, s^{(q)} \rangle}_{A}\)
\(c^{(q+1)} = r^{(q)}- \alpha^{(q)} As^{(q)}\)
\(s^{(q+1)} = r^{(q+1)} + \frac{r^{(q+1)} \cdot r^{(q+1)}}{r^{(q)}\cdot r^{(q)}}\cdot s^{(q)}\) (look! the term thing in the right is the only difference between this and gradient descent, by iteratively subtracting the residual iteratively, we compute in the number of steps equal to distinct eigenvalues)

If you metric is non-symmetric, none of this applies well and hence you should use other methods.

Local Approximations

Taylor expansion: Taylor Series

well-resolved functions

Regions of a function with less variations require lower sampling rates
Regions of a function with more variations require higher sampling rates

worries

sampling constantly doesn’t capture function’s well-resolvedness (amount of variation)
non-piecewise sampling + Taylor expansion breaks during discontinuity

piecewise analysis allows you to fix both of these problems

splitting training dataset and averaging

split data between two pieces
train 2 separate neural networks in different pieces (suppose different distributions exist in what you are trying to model)
inference it separately

To interpolate: k means cluster the inputs together along some dimension (say color), and then try to make a network for each cluster, then we average weights the based on the distance to each cluster.