backpropegation
Last edited: October 10, 2025backpropegation is a special case of “backwards differentiation” to update a computation grap.h
constituents
- chain rule: suppose \(J=J\qty(g_{1}…g_{k}), g_{i} = g_{i}\qty(\theta_{1} \dots \theta_{p})\), then \(\pdv{J}{\theta_{i}} = \sum_{j=1}^{K} \pdv{J}{g_{j}} \pdv{g_{j}}{\theta_{i}}\)
- a neural network
requirements
Consider the notation in the following two layer NN:
\begin{equation} z = w^{(1)} x + b^{(1)} \end{equation}
\begin{equation} a = \text{ReLU}\qty(z) \end{equation}
\begin{equation} h_{\theta}\qty(x) = w^{(2)} a + b^{(2)} \end{equation}
\begin{equation} J = \frac{1}{2}\qty(y - h_{\theta}\qty(x))^{2} \end{equation}
- in a forward pass, compute the values of each value \(z^{(1)}, a^{(1)}, \ldots\)
- in a backward pass, compute…
- \(\pdv{J}{z^{(f)}}\): by hand
- \(\pdv{J}{a^{(f-1)}}\): lemma 3 below
- \(\pdv{J}{z^{f-1}}\): lemma 2 below
- \(\pdv{J}{a^{(f-2)}}\): lemma 3 below
- \(\pdv{J}{z^{f-2}}\): lemme 2 below
- and so on… until we get to the first layer
- after obtaining all of these, we compute the weight matrices'
- \(\pdv{J}{W^{(f)}}\): lemma 1 below
- \(\pdv{J}{W^{(f-1)}}\): lemma 1 below
- …, until we get tot he first layer
chain rule lemmas
Pattern match your expressions against these, from the last layer to the first layer, to amortize computation.
deep learning
Last edited: October 10, 2025supervised learning with non-linear models.
Motivation
Previously, our learning method was linear in the parameters \(\theta\) (i.e. we can have non-linear \(x\), but our \(\theta\) is always linear). Today: with deep learning we can have non-linearity with both \(\theta\) and \(x\).
constituents
- We have \(\qty {\qty(x^{(i)}, y^{(i)})}_{i=1}^{n}\) the dataset
- Our loss \(J^{(i)}\qty(\theta) = \qty(y^{(i)} - h_{\theta}\qty(x^{(i)}))^{2}\)
- Our overall cost: \(J\qty(\theta) = \frac{1}{n} \sum_{i=1}^{n} J^{(i)}\qty(\theta)\)
- Optimization: \(\min_{\theta} J\qty(\theta)\)
- Optimization step: \(\theta = \theta - \alpha \nabla_{\theta} J\qty(\theta)\)
- Hyperparameters:
- Learning rate: \(\alpha\)
- Batch size \(B\)
- Iterations: \(n_{\text{iter}}\)
- stochastic gradient descent (where we randomly sample a dataset point, etc.) or batch gradient descent (where we scale learning rate by batch size and comput e abatch)
- neural network
requirements
additional information
Background
Notation:
Gaussian mixture model
Last edited: October 10, 2025Gaussian mixture model is a density estimation technique, which is useful for detecting out of distribution samples, etc.
We will use the superposition for a group of Gaussian distributions that would explain the dataset.
Suppose the data was generated from a Mixture of Gaussian; then for every data point \(x^{(i)}\) there is a latent \(z^{(i)}\) which tells you what Gaussian your data point is generated from.
So, for \(k\) Gaussian in your mixture:
\(z^{(i)} \in \qty {1, \dots, k}\) such that \(z^{(i)} \sim \text{MultiNom}\qty(\phi)\) (such that \(\phi_{j} \geq 0\), \(\sum_{j}^{} \phi_{j} = 1\))
matrix calculus
Last edited: October 10, 2025Transpose Rules
- \(\qty(AB)^{T} = B^{T}A^{T}\)
- \(\qty(a^{T}Bc)^{T} = c^{T} B^{T}a\)
- \(a^{T}b = b^{T}a\)
- \(\qty(A+B)C = AC + BC\)
- \(\qty(a+b)^{T}C = a^{T}C + b^{T}C\)
- \(AB \neq BA\)
Derivative
| Scalar derivative | Vector derivative |
|---|---|
| \(f\qty(x) \to \pdv{f}{x}\) | \(f\qty(x) \to \pdv{f}{x}\) |
| \(bx \to b\) | \(x^{T}B \to B\) |
| \(bx \to b\) | \(x^{T}b \to b\) |
| \(x^{2} \to 2x\) | \(x^{T}x \to 2x\) |
| \(bx^{2} \to 2bx\) | \(x^{T}Bx \to 2Bx\) |
Products
\begin{equation} \pdv{AB}{A} = B^{T}, \pdv{AB}{B} = A^{T} \end{equation}
\begin{equation} \pdv{Ax}{A} = x^{T}, \pdv{Ax}{x}= A \end{equation}
Normal Equation
Last edited: October 10, 2025constituents
Let’s also define our entire training examples and stack them in rows:
\begin{equation} X = \mqty( - x^{(1)}^{T} - \\ \dots \\ - x^{\qty(n)}^{T} - ) \end{equation}
\begin{equation} Y = \mqty(y^{(1)} \\ \dots \\ y^{(n)}) \end{equation}
requirements
least-squares error becomes:
\begin{equation} J\qty(\theta) = \frac{1}{2} \sum_{i=1}^{n} \qty(h\qty(x^{(i)}) - y^{(i)}) ^{2} = \qty(X \theta - y)^{T} \qty(X \theta - y) \end{equation}
Solving this exactly by taking the derivative of \(J\) and set it to \(0\) (i.e. for a minima, we obtain)
