expectation maximization

Sorta like “distribution-based k-means clustering

constituents

A Gaussian mixture model!

requirements

Two steps:

e-step

“guess the value of \(z^{(i)}\)”

\begin{align} w_{j}^{(i)} &= p\qty(z^{(i)} = j | x^{(i)} j; \phi, \mu, \Sigma) \\ &= \frac{p\qty(x^{(i)} | z^{(i)}=j) p\qty(z^{(i)}=j))}{\sum_{l=1}^{k}p\qty(x^{(i)} | z^{(i)}=l) p\qty(z^{(i)}=l))} \end{align}

Where we have:

  • \(p\qty(x^{(i)} |z^{(i)}=j)\) from the Gaussian distribution, where we have \(\Sigma_{j}\) and \(\mu_{j}\) for the parameters of our Gaussian \(j\).
  • \(p\qty(z^{(i)} =j)\) is just \(\phi_{j}\) which we are learning

These weights \(w_{j}\) are how much the model believes it belongs to each cluster.

m-step

“Update parameters using our guesses”

\begin{equation} \phi_{j} = \frac{1}{n} \sum_{i=1}^{n} w_{j}^{(i)} \end{equation}

\begin{equation} \mu_{j} = \frac{\sum_{i=1}^{n} w_{j}^{(i)}x^{(i)}}{ \sum_{i=1}^{n} w_{j}^{(i)}} \end{equation}

additional information

justification

We have a probability distribution:

\begin{equation} p\qty(x,z) = p\qty(x|z)p\qty(z) \end{equation}

such that:

\begin{equation} p\qty(x) = \sum_{j=1}^{k}p\qty(x|z=j) p\qty(z=j) \end{equation}

Our goal is to choose model parameters \(\phi, \mu, \Sigma\) such that we maximize:

\begin{equation} \max_{\phi, \mu, \Sigma} \prod_{i=1}^{n} p\qty(x^{(i)} ; \phi, \mu, \Sigma) \end{equation}

We observe only \(z\), and for parameters \(\theta = \phi, \mu, \Sigma\), we want to maximize:

\begin{equation} l\qty(\theta) = \sum_{i=1}^{n} \log p\qty(x^{(i)}; \theta) = \sum_{i}^{} \log \sum_{z^{(i)}}^{} p\qty(x^{(i)}, z^{(i)}; \theta) \end{equation}

where the second equality is true by law of total probability.

The EM algorithm constructs a lower bound (E), and then moves parameters to that new lower bound (M).

Now, consider:

\begin{align} l\qty(\theta) &= \sum_{i=1}^{n} \log p\qty(x^{(i)}; \theta) = \sum_{i}^{} \log \sum_{z^{(i)}}^{} p\qty(x^{(i)}, z^{(i)}; \theta) \\ &= \sum_{i}^{} \log \sum_{z^{(i)}}^{} Q_{i}\qty(z^{(i)}) \cdot \qty [ \frac{p\qty(x^{(i)}, z^{(i)}; \theta)}{Q_{i}\qty(z^{(i)})}] \end{align}

for some multinomial probability of \(z^{(i)}\), \(Q\). This obtains:

[lecture ran out of time]

E step: eventually this gives a lower bound on \(l\); then M step tries to use parameters to maximize this lower bound.

Jensen’s Inequality

Let \(f\) be a convex function; that is, \(f’’\qty(x) \geq 0\); let \(x\) be a random variable. Then, \(f\qty(\mathbb{E}[x]) \leq \mathbb{E}\qty [f\qty(x)]\).

Further, if \(f\) is strictly convex, that is \(f’’\qty(x) > 0\), then \(\mathbb{E}\qty [f\qty(x)] = f\qty(\mathbb{E}[x])\), that is, \(x\) is constant.

  • concave edition

    Let \(f\) be a concave function; that is, \(f’’\qty(x) \leq 0\); let \(x\) be a random variable. Then, \(f\qty(\mathbb{E}[x]) \geq \mathbb{E}\qty [f\qty(x)]\).

    Further, if \(f\) is strictly concave, that is \(f’’\qty(x) < 0\), then \(\mathbb{E}\qty [f\qty(x)] = f\qty(\mathbb{E}[x])\), that is, \(x\) is constant.

failure mode

The EM algorithm has a failure mode whereby it puts the weight of one Gaussian WAAY on top of one point (i.e. infinite log likelihood), and sped out prediction everywhere else.

In that case we give up and try again.

additional information

For Gaussian mixture model. If we knew \(z^{(i)}\), we can estimate parameters using Maximum Likelihood Parameter Learning:

\begin{equation} \phi_{j} = \frac{1}{n} \sum_{i=1}^{n} \mathbbm{1}\qty {z^{(i)=j}} \end{equation}

\begin{equation} \mu_{j} = \frac{\sum_{i=1}^{n} \mathbbm{1} \qty {z^{(i)} = j} x^{(i)}}{\sum_{i=1}^{n} \mathbbm{1} \qty {z^{(i)} = j} } \end{equation}