deep learning
Last edited: October 10, 2025supervised learning with non-linear models.
Motivation
Previously, our learning method was linear in the parameters \(\theta\) (i.e. we can have non-linear \(x\), but our \(\theta\) is always linear). Today: with deep learning we can have non-linearity with both \(\theta\) and \(x\).
constituents
- We have \(\qty {\qty(x^{(i)}, y^{(i)})}_{i=1}^{n}\) the dataset
- Our loss \(J^{(i)}\qty(\theta) = \qty(y^{(i)} - h_{\theta}\qty(x^{(i)}))^{2}\)
- Our overall cost: \(J\qty(\theta) = \frac{1}{n} \sum_{i=1}^{n} J^{(i)}\qty(\theta)\)
- Optimization: \(\min_{\theta} J\qty(\theta)\)
- Optimization step: \(\theta = \theta - \alpha \nabla_{\theta} J\qty(\theta)\)
- Hyperparameters:
- Learning rate: \(\alpha\)
- Batch size \(B\)
- Iterations: \(n_{\text{iter}}\)
- stochastic gradient descent (where we randomly sample a dataset point, etc.) or batch gradient descent (where we scale learning rate by batch size and comput e abatch)
- neural network
requirements
additional information
Background
Notation:
Gaussian mixture model
Last edited: October 10, 2025Gaussian mixture model is a density estimation technique, which is useful for detecting out of distribution samples, etc.
We will use the superposition for a group of Gaussian distributions that would explain the dataset.
Suppose the data was generated from a Mixture of Gaussian; then for every data point \(x^{(i)}\) there is a latent \(z^{(i)}\) which tells you what Gaussian your data point is generated from.
So, for \(k\) Gaussian in your mixture:
\(z^{(i)} \in \qty {1, \dots, k}\) such that \(z^{(i)} \sim \text{MultiNom}\qty(\phi)\) (such that \(\phi_{j} \geq 0\), \(\sum_{j}^{} \phi_{j} = 1\))
matrix calculus
Last edited: October 10, 2025Transpose Rules
- \(\qty(AB)^{T} = B^{T}A^{T}\)
- \(\qty(a^{T}Bc)^{T} = c^{T} B^{T}a\)
- \(a^{T}b = b^{T}a\)
- \(\qty(A+B)C = AC + BC\)
- \(\qty(a+b)^{T}C = a^{T}C + b^{T}C\)
- \(AB \neq BA\)
Derivative
| Scalar derivative | Vector derivative |
|---|---|
| \(f\qty(x) \to \pdv{f}{x}\) | \(f\qty(x) \to \pdv{f}{x}\) |
| \(bx \to b\) | \(x^{T}B \to B\) |
| \(bx \to b\) | \(x^{T}b \to b\) |
| \(x^{2} \to 2x\) | \(x^{T}x \to 2x\) |
| \(bx^{2} \to 2bx\) | \(x^{T}Bx \to 2Bx\) |
Products
\begin{equation} \pdv{AB}{A} = B^{T}, \pdv{AB}{B} = A^{T} \end{equation}
\begin{equation} \pdv{Ax}{A} = x^{T}, \pdv{Ax}{x}= A \end{equation}
Normal Equation
Last edited: October 10, 2025constituents
Let’s also define our entire training examples and stack them in rows:
\begin{equation} X = \mqty( - x^{(1)}^{T} - \\ \dots \\ - x^{\qty(n)}^{T} - ) \end{equation}
\begin{equation} Y = \mqty(y^{(1)} \\ \dots \\ y^{(n)}) \end{equation}
requirements
least-squares error becomes:
\begin{equation} J\qty(\theta) = \frac{1}{2} \sum_{i=1}^{n} \qty(h\qty(x^{(i)}) - y^{(i)}) ^{2} = \qty(X \theta - y)^{T} \qty(X \theta - y) \end{equation}
Solving this exactly by taking the derivative of \(J\) and set it to \(0\) (i.e. for a minima, we obtain)
