expectation maximization

Sorta like “distribution-based k-means clustering”. Jensen’s Inequality guarantees convergence (i.e. each parameter will converge to the maximum possible parameter).

constituents

A Gaussian mixture model!

requirements

Two steps:

e-step

“guess the value of \(z^{(i)}\); soft guesses of cluster assignments”

\begin{align} w_{j}^{(i)} &= p\qty(z^{(i)} = j | x^{(i)} ; \phi, \mu, \Sigma) \\ &= \frac{p\qty(x^{(i)} | z^{(i)}=j) p\qty(z^{(i)}=j))}{\sum_{l=1}^{k}p\qty(x^{(i)} | z^{(i)}=l) p\qty(z^{(i)}=l))} \end{align}

Where we have:

\(p\qty(x^{(i)} |z^{(i)}=j)\) from the Gaussian distribution, where we have \(\Sigma_{j}\) and \(\mu_{j}\) for the parameters of our Gaussian \(j\).
\(p\qty(z^{(i)} =j)\) is just \(\phi_{j}\) which we are learning

These weights \(w_{j}\) are how much the model believes it belongs to each cluster.

“Form new lower bound for likelihood by guessing soft values \(z\) assignment probabilities for each point via probabilities we computed.”

m-step

“Update parameters using our guesses to their maximum likelihood values given new parameters, their current values, and data”

\begin{equation} \phi_{j} = \frac{1}{n} \sum_{i=1}^{n} w_{j}^{(i)} \end{equation}

\begin{equation} \mu_{j} = \frac{\sum_{i=1}^{n} w_{j}^{(i)}x^{(i)}}{ \sum_{i=1}^{n} w_{j}^{(i)}} \end{equation}

\begin{equation} \Sigma_{j} = \frac{\sum_{i=1}^{n} w_{j}^{(i)}\qty(x^{(i)}-u_{j})\qty(x^{(i)}-u_{j})^{T}}{\sum_{i=1}^{n}w_{j}^{(i)}} \end{equation}

“Choose parameters that would maximize lower bound given our soft guesses for \(z\).”

additional information

justification

We have a probability distribution:

\begin{equation} p\qty(x,z) = p\qty(x|z)p\qty(z) \end{equation}

such that:

\begin{equation} p\qty(x) = \sum_{j=1}^{k}p\qty(x|z=j) p\qty(z=j) \end{equation}

Our goal is to choose model parameters \(\phi, \mu, \Sigma\) such that we maximize:

\begin{equation} \max_{\phi, \mu, \Sigma} \prod_{i=1}^{n} p\qty(x^{(i)} ; \phi, \mu, \Sigma) \end{equation}

We observe only \(z\), and for parameters \(\theta = \phi, \mu, \Sigma\), we want to maximize:

\begin{equation} l\qty(\theta) = \sum_{i=1}^{n} \log p\qty(x^{(i)}; \theta) = \sum_{i}^{} \log \sum_{z^{(i)}}^{} p\qty(x^{(i)}, z^{(i)}; \theta) \end{equation}

where the second equality is true by law of total probability.

The EM algorithm constructs a lower bound (E), and then moves parameters to that new lower bound (M).

Now, consider:

\begin{align} l\qty(\theta) &= \sum_{i=1}^{n} \log p\qty(x^{(i)}; \theta) = \sum_{i}^{} \log \sum_{z^{(i)}}^{} p\qty(x^{(i)}, z^{(i)}; \theta) \\ &= \sum_{i}^{} \log \sum_{z^{(i)}}^{} Q_{i}\qty(z^{(i)}) \cdot \qty [ \frac{p\qty(x^{(i)}, z^{(i)}; \theta)}{Q_{i}\qty(z^{(i)})}] \end{align}

for some multinomial probability of \(z^{(i)}\), \(Q\). This obtains:

[lecture ran out of time]

E step: eventually this gives a lower bound on \(l\); then M step tries to use parameters to maximize this lower bound.

Jensen’s Inequality

Let \(f\) be a convex function; that is, \(f’’\qty(x) \geq 0\); let \(x\) be a random variable. Then, \(f\qty(\mathbb{E}[x]) \leq \mathbb{E}\qty [f\qty(x)]\).

Further, if \(f\) is strictly convex, that is \(f’’\qty(x) > 0\), then \(\mathbb{E}\qty [f\qty(x)] = f\qty(\mathbb{E}[x])\), that is, \(x\) is constant.

concave edition
Let \(f\) be a concave function; that is, \(f’’\qty(x) \leq 0\); let \(x\) be a random variable. Then, \(f\qty(\mathbb{E}[x]) \geq \mathbb{E}\qty [f\qty(x)]\).
Further, if \(f\) is strictly concave, that is \(f’’\qty(x) < 0\), then \(\mathbb{E}\qty [f\qty(x)] = f\qty(\mathbb{E}[x])\), that is, \(x\) is constant.

failure mode

The EM algorithm has a failure mode whereby it puts the weight of one Gaussian WAAY on top of one point (i.e. infinite log likelihood), and sped out prediction everywhere else.

In that case we give up and try again.

additional information

For Gaussian mixture model. If we knew \(z^{(i)}\), we can estimate parameters using Maximum Likelihood Parameter Learning:

\begin{equation} \phi_{j} = \frac{1}{n} \sum_{i=1}^{n} \mathbbm{1}\qty {z^{(i)=j}} \end{equation}

\begin{equation} \mu_{j} = \frac{\sum_{i=1}^{n} \mathbbm{1} \qty {z^{(i)} = j} x^{(i)}}{\sum_{i=1}^{n} \mathbbm{1} \qty {z^{(i)} = j} } \end{equation}