Line Search and Steepest Design
Gram-Schmidt For Matrix Orthogonality
You can use Gram-Schmidt to find matrix orthogonality. In particular, for a series of vectors \(s^{(j)}\) forming a matrix \(A\):
\begin{equation} s^{(q)} = s^{(q)}- \sum_{q’=1}^{q-1} \frac{\langle s^{(q)}, s^{(q’)} \rangle_{A}}{\langle s^{(q’)}, s^{(q’)} \rangle_{A}}s^{(q’)} \end{equation}
for Conjugate Gradient, it works out such that only one such dot products is non-zero, so we can write:
\begin{equation} s^{(q)} = r^{(q)} + \frac{r^{(q)}\cdot r^{(q)}}{r^{(q-1)}\cdot r^{(q-1)}} s^{(q-1)} \end{equation}
for residual \(r^{(q)}\), and
Conjugate Gradient
For \(Ac = b\), let’s write:
Start with \(s^{(i)} = r^{(i)} = b - Ac^{(i)}\)
Iteration:
- \(\alpha^{(q)} = \frac{r^{(q)}\cdot r^{(q)}}{\langle s^{(q)}, s^{(q)} \rangle}_{A}\)
- \(c^{(q+1)} = r^{(q)}- \alpha^{(q)} As^{(q)}\)
- \(s^{(q+1)} = r^{(q+1)} + \frac{r^{(q+1)} \cdot r^{(q+1)}}{r^{(q)}\cdot r^{(q)}}\cdot s^{(q)}\) (look! the term thing in the right is the only difference between this and gradient descent, by iteratively subtracting the residual iteratively, we compute in the number of steps equal to distinct eigenvalues)
If you metric is non-symmetric, none of this applies well and hence you should use other methods.
Local Approximations
Taylor expansion: Taylor Series
well-resolved functions
- Regions of a function with less variations require lower sampling rates
- Regions of a function with more variations require higher sampling rates
worries
- sampling constantly doesn’t capture function’s well-resolvedness (amount of variation)
- non-piecewise sampling + Taylor expansion breaks during discontinuity
piecewise analysis allows you to fix both of these problems
splitting training dataset and averaging
- split data between two pieces
- train 2 separate neural networks in different pieces (suppose different distributions exist in what you are trying to model)
- inference it separately
To interpolate: k means cluster the inputs together along some dimension (say color), and then try to make a network for each cluster, then we average weights the based on the distance to each cluster.