descent method

Descent methods are generally of shape:

\begin{align} x^{(k+1)} = x^{(k)} + t^{(k)} \delta x^{(k)} \end{align}

choosing these is a matter of which descent method you choose. The Hessian thus exposes:

\begin{align} \qty {x + v \mid v^{T} \nabla^{2} f\qty(x) v \leq 1} \end{align}

line search

backtracking line search

A first-order Taylor line in \(t\), centered about \(x\):

\begin{align} f\qty(x) + t \nabla f\qty(x)^{T} \Delta x \end{align}

We can degrade this to raise it to be slightly higher:

\begin{align} f\qty(x) + \alpha t \nabla f\qty(x)^{T} \Delta x \end{align}

and we take all choices of \(t\) for which

\begin{align} f\qty(x + t\Delta x) \leq f\qty(x) + \alpha t \nabla f\qty(x)^{T} \Delta x \end{align}

gradient descent

Set:

\begin{align} \Delta x = -\nabla f\qty(x) \end{align}

convergence: for strongly convex \(f\):

\begin{align} f\qty(x^{(k)}) - p^{*} \leq c^{k} \qty(f^{(0)} - p^{*}) \end{align}

for some constant \(c\).

Gradient methods can be slow, and in particular exhibit “zigzagging” whenever your sublevelsets are not spherical (unisotrphic); so the best gradient method performance are on sublevels sets that are isotropic. You can measure this by condition number.

steepest descent

Gradient methods can be very slow. Thus we want to take a “reasonably sized” direction in the steepest descent.

\begin{align} \Delta x = \qty {\nabla f\qty(x) v \mid \norm{v} \leq 1} \end{align}

Some choices of norms:

Euclidian norm: \(-\nabla f\qty(x)\) (its just
quadratic norm: for \(\norm{x}_{P} = \qty(x^{T} P x)^{\frac{1}{2}}\), we have \(-P^{-1}\nabla f\qty(x)\) (you can imagine this as a change of variables in \(P^{\frac{1}{2}}x\).

Newton’s Method

See Newton’s Method