Houjun Liu

SU-CS361 APR162024

Stochastic Methods

Stochastic Methods is where you use randomness strategically to escape local minima. This typically rely on pseudo-random generation.

Noisy Descent

Gradient descent but slightly bad:

\begin{equation} \bold{x} = \bold{x} + \alpha \nabla f_{\bold{x}} + \epsilon \end{equation}


\begin{equation} \epsilon \sim \mathcal{N}(0, \lambda I) \end{equation}

This is like Generalized Pattern Search, but instead of a fixed positive spanning set we change the directions of the span vectors every single step; once randomized, then we expand/shrink as needed.

simulated annealing

see simulated annealing

Cross Entropy Method

We can also do Cross Entropy Method: choose an initial distribution and sample; refit distribution on the \(k\) best points, start again with sampling.

Natural Evolution

Natural Evolution is Cross Entropy Method but more nuanced. Consider a design point sampling distribution parameterized by \(\theta\), our objective:

\begin{equation} \min_{x \sim p(\cdot | \theta)} \mathbb{E} [f(x)] \end{equation}

(“minimize the expectation of the objective function over sampled points from that sampling distribution”)

To perform this minimization, we use gradient descent. Writing this out:

\begin{align} \nabla_{\theta}\mathbb{E} [f(x)] &=\nabla_{\theta} \int p(x|\theta) f(x) \dd{x} \\ &= \int \frac{p(x|\theta)}{p(x|theta)} \nabla_{\theta} p(x|\theta) f(x) \dd{x} \\ &= \int p(x|\theta) \qty(\frac{1}{p(x|\theta)}\nabla_{\theta} p(x|\theta)) f(x) \dd{x} \\ &= \int p(x|\theta) \qty(\nabla_{\theta} \log p(x|\theta)) f(x) \dd{x} \\ &= \mathbb{E}_{x \sim p(\cdot | \theta)} \qty[f(x) \nabla_{\theta}\log p(x|\theta)] \\ &\approx \frac{1}{m} \sum_{i}^{} f(x) \nabla_{\theta} \log p(x|\theta) \end{align}

where we used the “log derivative trick” (i.e. the chain rule with the log)

Notice: this works for non differentiable \(f\)!!!!!

And then we just do gradient descent with this instead of refitting points like in Cross Entropy Method.

Population Method

Population Methods optimize, instead of a single design point, a cohort of design points.

Typically, we use this by first using Population Methods to get a generally good point, then, on each step, we fine tune this via iterated descent; two main idea:

Generic Algorithms

  • every single design is represented by a sequence of chromosomes (it sometimes use a binary representation, but real valued vectors work well as well)
  • select the fittest individuals
    • truncation selection: cutting off most of the lowest performers
    • tournament selection: select fittest out of \(k\) randomly chosen individuals
    • roulette wheel selection: sample individuals with probability proportional to their fitness
  • children forming (from fittest)
    • single-point crossover: randomly select a point between the chromosome sequence, then generate a child design by taking the chunk before and after that point from each parent
    • two-point crossover: the above, but only cross over a single chunk
    • uniform crossover: choose elements from parents with 50% probability
  • population mutation
    • randomly flip each bit/reinitialize some values

Particle Swarm Optimization

for each individual in the papulation, we keep track of

  • current position
  • current velocity
  • best position so far by the current particle
  • best position seen by all particles

\begin{equation} x^{(i)} = x^{(i)} + v^{(i)} \end{equation}

\begin{equation} v^{(i)} = w v^{(i)} + c_1 r_1 \qty(x^{(i)}_{\text{your best}} - x^{(i)}) + c_2 r_2 \qty(x_{\text{global best}} - x^{(i)}) \end{equation}

basically, this is momentum where we try to go both towards our local optima that we’ve seen and also in a little towards the direction that the entire swarm has seen

Multi-Objective Population Method

See Multi-Objective Population Method