_index.org

deep learning

Last edited: October 10, 2025

supervised learning with non-linear models.

Motivation

Previously, our learning method was linear in the parameters \(\theta\) (i.e. we can have non-linear \(x\), but our \(\theta\) is always linear). Today: with deep learning we can have non-linearity with both \(\theta\) and \(x\).

constituents

  • We have \(\qty {\qty(x^{(i)}, y^{(i)})}_{i=1}^{n}\) the dataset
  • Our loss \(J^{(i)}\qty(\theta) = \qty(y^{(i)} - h_{\theta}\qty(x^{(i)}))^{2}\)
  • Our overall cost: \(J\qty(\theta) = \frac{1}{n} \sum_{i=1}^{n} J^{(i)}\qty(\theta)\)
  • Optimization: \(\min_{\theta} J\qty(\theta)\)
  • Optimization step: \(\theta = \theta - \alpha \nabla_{\theta} J\qty(\theta)\)
  • Hyperparameters:
    • Learning rate: \(\alpha\)
    • Batch size \(B\)
    • Iterations: \(n_{\text{iter}}\)
  • stochastic gradient descent (where we randomly sample a dataset point, etc.) or batch gradient descent (where we scale learning rate by batch size and comput e abatch)
  • neural network

requirements

additional information

Background

Notation:

Gaussian mixture model

Last edited: October 10, 2025

Gaussian mixture model is a density estimation technique, which is useful for detecting out of distribution samples, etc.

We will use the superposition for a group of Gaussian distributions that would explain the dataset.

Suppose the data was generated from a Mixture of Gaussian; then for every data point \(x^{(i)}\) there is a latent \(z^{(i)}\) which tells you what Gaussian your data point is generated from.

So, for \(k\) Gaussian in your mixture:

\(z^{(i)} \in \qty {1, \dots, k}\) such that \(z^{(i)} \sim \text{MultiNom}\qty(\phi)\) (such that \(\phi_{j} \geq 0\), \(\sum_{j}^{} \phi_{j} = 1\))

matrix calculus

Last edited: October 10, 2025

Transpose Rules

  • \(\qty(AB)^{T} = B^{T}A^{T}\)
  • \(\qty(a^{T}Bc)^{T} = c^{T} B^{T}a\)
  • \(a^{T}b = b^{T}a\)
  • \(\qty(A+B)C = AC + BC\)
  • \(\qty(a+b)^{T}C = a^{T}C + b^{T}C\)
  • \(AB \neq BA\)

Derivative

Scalar derivativeVector derivative
\(f\qty(x) \to \pdv{f}{x}\)\(f\qty(x) \to \pdv{f}{x}\)
\(bx \to b\)\(x^{T}B \to B\)
\(bx \to b\)\(x^{T}b \to b\)
\(x^{2} \to 2x\)\(x^{T}x \to 2x\)
\(bx^{2} \to 2bx\)\(x^{T}Bx \to 2Bx\)

Products

\begin{equation} \pdv{AB}{A} = B^{T}, \pdv{AB}{B} = A^{T} \end{equation}

\begin{equation} \pdv{Ax}{A} = x^{T}, \pdv{Ax}{x}= A \end{equation}

Normal Equation

Last edited: October 10, 2025

constituents

Let’s also define our entire training examples and stack them in rows:

\begin{equation} X = \mqty( - x^{(1)}^{T} - \\ \dots \\ - x^{\qty(n)}^{T} - ) \end{equation}

\begin{equation} Y = \mqty(y^{(1)} \\ \dots \\ y^{(n)}) \end{equation}

requirements

least-squares error becomes:

\begin{equation} J\qty(\theta) = \frac{1}{2} \sum_{i=1}^{n} \qty(h\qty(x^{(i)}) - y^{(i)}) ^{2} = \qty(X \theta - y)^{T} \qty(X \theta - y) \end{equation}

Solving this exactly by taking the derivative of \(J\) and set it to \(0\) (i.e. for a minima, we obtain)

SU-CS161 OCT302025

Last edited: October 10, 2025

Key Sequence

Notation

New Concepts

Important Results / Claims

Questions

Interesting Factoids