## Stochastic Methods

Stochastic Methods is where you use randomness **strategically** to escape local minima. This typically rely on **pseudo-random generation**.

### Noisy Descent

Gradient descent but slightly bad:

\begin{equation} \bold{x} = \bold{x} + \alpha \nabla f_{\bold{x}} + \epsilon \end{equation}

where:

\begin{equation} \epsilon \sim \mathcal{N}(0, \lambda I) \end{equation}

### Mesh Adaptive Direct Search

This is like Generalized Pattern Search, but instead of a fixed positive spanning set we change the directions of the span vectors every single step; once randomized, then we expand/shrink as needed.

### simulated annealing

### Cross Entropy Method

We can also do Cross Entropy Method: choose an initial distribution and sample; refit distribution on the \(k\) best points, start again with sampling.

## Natural Evolution

Natural Evolution is Cross Entropy Method but more nuanced. Consider a design point sampling distribution parameterized by \(\theta\), our objective:

\begin{equation} \min_{x \sim p(\cdot | \theta)} \mathbb{E} [f(x)] \end{equation}

(“minimize the expectation of the objective function over sampled points from that sampling distribution”)

To perform this minimization, we use gradient descent. Writing this out:

\begin{align} \nabla_{\theta}\mathbb{E} [f(x)] &=\nabla_{\theta} \int p(x|\theta) f(x) \dd{x} \\ &= \int \frac{p(x|\theta)}{p(x|theta)} \nabla_{\theta} p(x|\theta) f(x) \dd{x} \\ &= \int p(x|\theta) \qty(\frac{1}{p(x|\theta)}\nabla_{\theta} p(x|\theta)) f(x) \dd{x} \\ &= \int p(x|\theta) \qty(\nabla_{\theta} \log p(x|\theta)) f(x) \dd{x} \\ &= \mathbb{E}_{x \sim p(\cdot | \theta)} \qty[f(x) \nabla_{\theta}\log p(x|\theta)] \\ &\approx \frac{1}{m} \sum_{i}^{} f(x) \nabla_{\theta} \log p(x|\theta) \end{align}

where we used the “log derivative trick” (i.e. the chain rule with the log)

Notice: this works for **non differentiable \(f\)**!!!!!

And then we just do gradient descent with this instead of refitting points like in Cross Entropy Method.

## Population Method

Population Methods optimize, instead of a single design point, a cohort of design points.

Typically, we use this by first using Population Methods to get a generally good point, then, on each step, we fine tune this via iterated descent; two main idea:

- Lamarckian Learning: do some Population Method, and then perform regular descent on each individual
- Baldwinian Learning: do some descent method, and weight the fitness by the values from the descent (this preserve the diversity)

### Generic Algorithms

- every single design is represented by a sequence of
**chromosomes**(it sometimes use a binary representation, but real valued vectors work well as well) - select the fittest individuals
**truncation selection**: cutting off most of the lowest performers**tournament selection**: select fittest out of \(k\) randomly chosen individuals**roulette wheel selection**: sample individuals with probability proportional to their fitness

- children forming (from fittest)
**single-point crossover**: randomly select a point between the**chromosome**sequence, then generate a child design by taking the chunk before and after that point from each parent**two-point crossover**: the above, but only cross over a single chunk**uniform crossover**: choose elements from parents with 50% probability

- population mutation
- randomly flip each bit/reinitialize some values

### Particle Swarm Optimization

for each individual in the papulation, we keep track of

- current position
- current velocity
- best position so far by the current particle
- best position seen by all particles

\begin{equation} x^{(i)} = x^{(i)} + v^{(i)} \end{equation}

\begin{equation} v^{(i)} = w v^{(i)} + c_1 r_1 \qty(x^{(i)}_{\text{your best}} - x^{(i)}) + c_2 r_2 \qty(x_{\text{global best}} - x^{(i)}) \end{equation}

basically, this is momentum where we try to go both towards our local optima that we’ve seen and also in a little towards the direction that the entire swarm has seen