Generative Learning Algorithm

Gaussian Discriminant Analysis

High level idea: 1) fit parameters to the positive and negative examples separately as a multi-variant Gaussian density 2) try to see if a new samples’ probablitity to 1) is greater or 2) is greater.

requirements

fit a parameter to

additional information

multivariant gaussian

See multi-variant Gaussian density. If it helps, here you go:

\begin{equation} p\qty(z) = \frac{1}{\qty(2\pi)^{\frac{|z|}{2}}|\Sigma|^{\frac{1}{2}}} \exp \qty(-\frac{1}{2} \qty(x-\mu)^{T} \Sigma^{-1} \qty(x - \mu)) \end{equation}

making predictions

Suppose \(p\qty(y=1) = \phi\), \(p\qty(y=0) = 1-\phi\).

Now, this means that:

\begin{equation} p\qty(y) = \phi^{y} \qty(1-\phi)^{1-y} \end{equation}

Now, we can write using multi-variant Gaussian density that:

\begin{equation} p\qty(x\mid y = 0) = \frac{1}{\qty(2\pi)^{\frac{|z|}{2}}|\Sigma|^{\frac{1}{2}}} \exp \qty(-\frac{1}{2} \qty(x-\mu_{0})^{T} \Sigma^{-1} \qty(x - \mu_{0})) \end{equation}

\begin{equation} p\qty(x\mid y = 1) = \frac{1}{\qty(2\pi)^{\frac{|z|}{2}}|\Sigma|^{\frac{1}{2}}} \exp \qty(-\frac{1}{2} \qty(x-\mu_{1})^{T} \Sigma^{-1} \qty(x - \mu_{1})) \end{equation}

Finally, to predict \(p\qty(\cdot | x)\), we use Naive Bayes using the \(p\qty(y)\) and \(p\qty(x|y)\) above. In particular we can have:

\begin{align} y &= \arg\max_{y} p\qty(y|x) \\ &= \arg\max_{y} \frac{p\qty(x|y)p\qty(y)}{p\qty(x)} \\ &= \arg\max_{y} p\qty(x|y)p\qty(y) \end{align}

applying Bayes rule to make prediction makes the sigmoid function

Fun fact: applying the Bayes rule on making predictions will just hand you the sigmoid function; i.e. the predicted \(p\qty(y=1|x)\) is just going to be a sigmoid. That is, assuming:

\begin{equation} \begin{cases} x|y=0 \sim \mathcal{N}\qty(\mu_{0}, \Sigma) \\ x|y=1 \sim \mathcal{N}\qty(\mu_{1}, \Sigma) \\ y \sim \text{Ber}\qty(\phi) \end{cases} \end{equation}

results in the assumption that

\begin{equation} p\qty(y=1|x) \end{equation}

is logistic. Meaning, GDA makes stronger assumptions than logistic regression.

double fun fact:

\begin{equation} \begin{cases} x|y=0 \sim \text{ExFam}\qty(\eta_{0}) \\ x|y=1 \sim \text{ExFam}\qty(\eta_{1}) \\ y \sim \text{Ber}\qty(\phi) \end{cases} \end{equation}

will also be logistic

So why are we doing any of this?!
If you have a large dataset with lots of noise, logistic regression will do better. If you know your dataset is a Gaussian, this will fit faster (i.e. so good for data-constrained regimes).

fitting

Our goal is to solve for \(\mu_{0}, \mu_{1}\) to make predictions above that maximizes the joint likelihood of our system.

\begin{equation} \mathcal{L}\qty(\phi, \mu_{0}, \mu_{1}, \Sigma) = \prod_{i=1}^{n} p\qty(x^{(i)}, y^{(i)}; \phi, \mu_{0}, \mu_{1}, \Sigma) \end{equation}

in particular:

\begin{equation} \mathcal{L}\qty(\phi, \mu_{0}, \mu_{1}, \Sigma) = \prod_{i=1}^{n} p\qty(x^{(i)}|y^{(i)}) p\qty(y^{(i)}) \end{equation}

And finally, apply the log trick, you can find parameters \(\mu_{0}\) and \(\mu_{1}\) such that:

\begin{equation} \max_{\phi, \mu_{0}, \mu_{1}, \Sigma} \sum_{i=1}^{n} \log \qty[p\qty(x^{(i)}|y^{(i)}) p\qty(y^{(i)})] \end{equation}

If you do the derivative thing and go a bunch of brrr and solving, we obtain:

\begin{equation} \phi = \frac{\sum_{i=1}^{n}y^{(i)}}{n} = \frac{\sum_{i=1}^{n} 1\qty {y^{(i)}=1}}{n} \end{equation}

and the means are just the mean of all samples in each class

\begin{equation} \mu_{0} = \frac{\sum_{i=1}^{n} 1 \qty {y^{(i)}=0} x^{(i)}}{\sum_{i=1}^{n}1 \qty {y^{(i)}=0}} \end{equation}

\begin{equation} \mu_{0} = \frac{\sum_{i=1}^{n} 1 \qty {y^{(i)}=1} x^{(i)}}{\sum_{i=1}^{n}1 \qty {y^{(i)}=1}} \end{equation}

and the covariance is a function of these means:

\begin{equation} \Sigma = \frac{1}{n} \sum_{i=1}^{n} \qty(x^{(i)}-\mu_{y^{(i)}}) \qty(x^{(i)}- \mu_{y(i)})^{T} \end{equation}

“why do we have a single covariance for all classes”?

For #reasons, a single covariance matrix will result in a linear decision boundary. Otherwise, a custom covariance for each class would result in a non-linear boundary.