## Diffusion Models

We can consider a model between random noise and trees.

For every step, we sample Gaussian noise and **add** it to the image. The original approach adds Gaussian to the pixels, and nowadays people replace the pixel.

Usually, there is a few thousand steps of noising.

Why is it that we can’t have a one-step policy from noise to pictures? Because of a physics result that says the stability of diffusion becomes intractable at too large steps.

### loss function

One way we can model our objective is as a MLE. Because we are continuously adding noise, we can assume that

\begin{equation} y \sim \mathcal{N}(\mu = \hat{y}(\theta), \sigma^{2}=k) \end{equation}

If you compute MLE over the choice of \(\hat{y}(\theta)\), you get the squared error.

### ELBO

A cool loss function that diffusion actually uses that leverages the fact above but considers the entire diffusion process.

### LSTMs

Big text generation flaw with LSTMs: the latent state vector has to contain information about the ENTIRE sentence and have the information propagated through recursion. Information

### Cross Entropy

its MLE over a multinomials; the counts of everything that’s not the one-hot thing just so happens to be 0.

We are essentially computing the derivative of:

\begin{equation} \arg\max_{p_{correct}} p_{correct} \end{equation}

which is trying to maximize the categorical of only the correct element.