Houjun Liu

SU-CS361 APR112024

Hyper-gradient Descent

Adapt the execution of gradient descent to Hyper-gradient Descent! Recall the Descent Direction Iteration update rule:

For LR \(\alpha\), what if we write:

\begin{equation} \pdv{f\qty(x^{(k+1)})}{\alpha} = \pdv{f(x^{(k+1)}}{x^{(k+1)}} \pdv{x^{(k+1)}}{\alpha} \end{equation}

The left side is just \(f’(x^{(k+1)}) = \nabla_{x}f(x^{(k+1)})\). Recall that the right side is \(\pdv{\alpha} \qty(x^{(k)} - \alpha \nabla_{x} f(x^{(k)}))\). This evaluates to simply \(-\nabla_{x}f(x^{(k)})\).

Therefore:

\begin{align} \pdv{f\qty(x^{(k+1)})}{\alpha} &= \pdv{f(x^{(k+1)}}{x^{(k+1)}} \pdv{x^{(k+1)}}{\alpha} \\ &= \nabla_{x}f(x^{(k+1)}) \cdot (-\nabla_{x}f(x^{(k)})) \end{align}

And so, to update our step size/learning rate, we can:

\begin{align} \alpha^{(k+1)} &= \alpha^{(k)} - \mu \qty(\pdv{f\qty(x^{(k+1)})}{\alpha}) \\ &= \alpha^{(k)} - \mu \qty[\nabla_{x}f(x^{(k+1)}) \cdot (-\nabla_{x}f(x^{(k)}))] \\ &= \alpha^{(k)} + \mu \qty[\nabla_{x}f(x^{(k+1)}) \cdot (\nabla_{x}f(x^{(k)}))] \end{align}

Therefore, we update our weights based on steps of \(\alpha\), and update our learning rate too with respect to minimizing \(f\). You will note that optimal step sizes results in gradients being orthogonal; we see that this reflects this—no updates to \(\alpha\) happen if the gradients from \((k+1)\) is orthogonal to \(k\).

Second-Order Methods

Newton’s Method

See Newton’s Method

Secant Method

You can estimate the Hessian from the gradient to apply Newton’s Method; this requires doing a thing:

\begin{equation} f’’(x_{k}) \approx \frac{f’(x_{k}) - f’(x_{k-1})}{x_{k} - x_{k-1}} \end{equation}

Now, we can write:

\begin{equation} x_{t+1} = x_{t} - \frac{x_{t} - x_{t-1}}{f’(x_{t}) -f’(x_{t-1})} f’(x_{t}) \end{equation}

How do we do this for Hessian? Use one of—

due to approximate nature, this may take more steps to converge.

Direct Methods

cycle through the coordinates and do line search. For a function that has \(n\) design variables in the design point. In each iteration, we freeze all dimensions except one, do line search, then move on to the next coordinate frame, and repeat.

Broken

This may fail when there is a trough between two axes

After taking \(n\) cycles through all the directions once, also take a step in the average of the directions of the previous \(n\) steps. This kinda fixes the case where there’s a trough in between.

Powell’s Method

This is Accelerated Coordinate Search, but we forget the oldest search direction. Consider Cyclic Coordinate Search, which searches in steps of the basis vectors \(e^{1}, e^{2}, …, e_{n}\). Accelerated Coordinate Search searches in \(e^{1} … e^{n}, d\), where \(d\) is the averages of your previous steps.

Powell’s Method go once forward by dropping the first member of the list

  • \(e^{1} … e^{n}, d^{(1)}\)
  • \(e^{2}, …, e^{n}, d^{(1)}, d^{(2)}\)
  • \(e^{3}, …, e^{n}, d^{(1)}, d^{(2)}, d^{(3)}\)

Failure

This may rapidly result in linearly dependent variables due to so much averaging. We therefore perform a linear search to obtain a new \(d\) or just reset this list back to independent bases every so often.

See Hooke-Jeeves Policy Search, but you don’t check the Rollout Policy; you just go and evaluate the function.

big picture: check a set of local perturbations, move your center point to the lowest one, and perturb again. If you converge, shrink your perturbation size.

Hooke-Jeeves Search uses \(2n\) searches, one in each direction; for \(\mathbb{R}^{2}\), for instance, this requires \(4\) lookups because 2 basis, and one sample in positive and one sample in negative direction.

So, to save one lookup, we create a positive spanning set instead (i.e. spanning set for the space for which you are constrained to only use positive coefficients, meaning we need one more vector). This creates a “triangle” which requires \(n+1\) vectors, but we only check one point per spanning set member.

If you get a minimum point, just go there.

Dynamic Ordering

If you found a minimum point once, check that point again first after going there.

Nelder-Mead Simplex Method

Take a simplex to search. Sort its verticies by lowest to highest in terms of objective function value, with lowest point being called \(x_{l}\), the \(x_{h}\) the highest function value, \(x_{s}\) being the second highest function value. \(\bar{x}\) being the mean point between all points except \(x_{h}\).

You have four possible actions

  • reflection: take your worst point, and reflex across the mean line formed across the remaining points
  • expansion: take your best point, and move them
  • contraction: move the worst point closer to the mean of the other two
  • shrink: move all points closer together