deep learning is MLE performed with neural networks. A neural network is many logistic regression pieces (sic.?) stack on top of each other.

We begin motivating this with trying to solve MNIST with logistic regression. What a time to be alive. After each layer of deep learning, we are going to use a layer of “hidden variable”, made of singular logistic regressions,

Notation:

\(x\) is the input, \(h\) is the hidden layers, and \(\hat{y}\) is the prediction.

We call each weight, at each layer, from \(x_{i}\) to \(h_{j}\), \(\theta_{i,j}^{(h)}\). At every neuron on each layer, we calculate:

\begin{equation} h_{j} = \sigma\qty[\sum_{i}^{} x_{i} \theta_{i,j}^{(h)}] \end{equation}

\begin{equation} \hat{y} = \sigma\qty[\sum_{i}^{} h_{i}\theta_{i}^{(y)}] \end{equation}

note! we often

## backpropegation

backpropegation is a special case of “backwards differentiation” to update a computation grap.h

### Toy

Consider:

\begin{equation} L(a,b,c) = c(a+2b) \end{equation}

meaning, we obtain a graph that looks like:

in three steps, we have:

- \(d = 2b\)
- \(e = a+d\)
- \(L = e\cdot e\)

To perform backpropagation, we compute derivatives from right to left, computing first \(\pdv{L}{L}= 1\), then, moving slowly towards the left to obtain \(\pdv{L}{c} = \pdv{L}{L}\pdv{L}{c}\), and then \(\pdv{L}{e} = \pdv{L}{L}\pdv{L}{c}\) , and then \(\pdv{L}{d} = \pdv{L}{L}\pdv{L}{e}\pdv{e}{d}\) and so forth.

### Motivation

- deep learning is useful by having good \(\theta\)
- we can find useful thetas by MLE
- we MLE by doing optimization to maximize the likelyhood

### Example

For one data point, let us define our neural network:

\begin{equation} h_{j} = \sigma\qty[\sum_{i}^{} x_{i} \theta_{i,j}^{(h)}] \end{equation}

\begin{equation} \hat{y} = \sigma\qty[\sum_{i}^{} h_{i}\theta_{i}^{(y)}] \end{equation}

we can define our network:

\begin{equation} L(\theta) = P(Y=y|X=x) = (\hat{y})^{y} (1-\hat{y})^{1-y} \end{equation}

from IID datasets, we can multiply the probablities together:

\begin{equation} L(\theta) = \prod_{i=1}^{n} (\hat{y_{i}})^{y_{i}} (1-\hat{y_{i}})^{1-y_{i}} \end{equation}

and, to prevent calculus and derivative instability, we take the log:

\begin{equation} LL(\theta) = \sum_{i=1}^{n}{y_{i}}\log (\hat{y_{i}}) \cdot ( 1-y_{i} )\log (1-\hat{y_{i}}) \end{equation}

We want to maximise this, meaning we perform gradient ascent on this statement. Recall the chain rule; so we can break each layer down:

\begin{equation} \pdv{LL(\theta)}{\theta_{ij}^{h}} = \pdv{LL(\theta)}{\hat{y}} \pdv{\hat{y}}{h_{j}} \pdv{h_{j}}{\theta_{ij}^{h}} \end{equation}

furthermore, for any summation,

\begin{equation} \dv x \sum_{i=0}^{} x = \sum_{i=0}^{}\dv x x \end{equation}

So we can consider our derivatives with respect to each data point. When going about the second part, recall an important trick:

\begin{equation} \pdv{h_{i}} \qty[\sum_{i}^{} h_{i}\theta_{i}^{(y)}] \end{equation}

you will note that, for the inside derivative, much the summation expands