SU-CS361 APR042024
Last edited: August 8, 2025optimization inequalities cannot be strict
Consider:
\begin{align} \min_{x}&\ x \\ s.t.\ & x > 1 \end{align}
this has NO SOLUTION. (1,1) wouldn’t actually be in feasible set. So, we usually specify optimization without a strict inequality.
So, instead, we write:
\begin{align} \min_{x}&\ x \\ s.t.\ & x \geq 1 \end{align}
Univariate Conditions
First order Necessary Condition
\begin{equation} \nabla f(x^{*}) = 0 \end{equation}
Second order necessary condition
\begin{equation} \nabla^{2}f(x^{*}) \geq 0 \end{equation}
SU-CS361 APR092024
Last edited: August 8, 2025More Bracketing Methods
Quadratic search
if your function is unimodal…
- Pick three points that gets “high, low, high”
- Fit a quadratic to it, evaluate its minima and add it to the point set
- Now, drop any of the four resulting point
Shubert-Piyavskill Method
This is a Bracketing approach which grantees optimality WITHOUT unimodality by using the Lipschitz Constant. But, this only works in one dimension.
Consider a Lipschitz continuous function with Lipschitz Constant \(L\). We can get our two initial points \(a\) and \(b\). First, we arbitrarily pick a point in the middle to evaluate; this will give us a cone (see Lipschitz Condition) which bounds the function.
SU-CS361 APR112024
Last edited: August 8, 2025Hyper-gradient Descent
Adapt the execution of gradient descent to Hyper-gradient Descent! Recall the Descent Direction Iteration update rule:
For LR \(\alpha\), what if we write:
\begin{equation} \pdv{f\qty(x^{(k+1)})}{\alpha} = \pdv{f(x^{(k+1)}}{x^{(k+1)}} \pdv{x^{(k+1)}}{\alpha} \end{equation}
The left side is just \(f’(x^{(k+1)}) = \nabla_{x}f(x^{(k+1)})\). Recall that the right side is \(\pdv{\alpha} \qty(x^{(k)} - \alpha \nabla_{x} f(x^{(k)}))\). This evaluates to simply \(-\nabla_{x}f(x^{(k)})\).
Therefore:
\begin{align} \pdv{f\qty(x^{(k+1)})}{\alpha} &= \pdv{f(x^{(k+1)}}{x^{(k+1)}} \pdv{x^{(k+1)}}{\alpha} \\ &= \nabla_{x}f(x^{(k+1)}) \cdot (-\nabla_{x}f(x^{(k)})) \end{align}
And so, to update our step size/learning rate, we can:
SU-CS361 APR162024
Last edited: August 8, 2025Stochastic Methods
Stochastic Methods is where you use randomness strategically to escape local minima. This typically rely on pseudo-random generation.
Noisy Descent
Gradient descent but slightly bad:
\begin{equation} \bold{x} = \bold{x} + \alpha \nabla f_{\bold{x}} + \epsilon \end{equation}
where:
\begin{equation} \epsilon \sim \mathcal{N}(0, \lambda I) \end{equation}
Mesh Adaptive Direct Search
This is like Generalized Pattern Search, but instead of a fixed positive spanning set we change the directions of the span vectors every single step; once randomized, then we expand/shrink as needed.
