_index.org

backpropegation

Last edited: October 10, 2025

backpropegation is a special case of “backwards differentiation” to update a computation grap.h

constituents

  • chain rule: suppose \(J=J\qty(g_{1}…g_{k}), g_{i} = g_{i}\qty(\theta_{1} \dots \theta_{p})\), then \(\pdv{J}{\theta_{i}} = \sum_{j=1}^{K} \pdv{J}{g_{j}} \pdv{g_{j}}{\theta_{i}}\)
  • a neural network

requirements

Consider the notation in the following two layer NN:

\begin{equation} z = w^{(1)} x + b^{(1)} \end{equation}

\begin{equation} a = \text{ReLU}\qty(z) \end{equation}

\begin{equation} h_{\theta}\qty(x) = w^{(2)} a + b^{(2)} \end{equation}

\begin{equation} J = \frac{1}{2}\qty(y - h_{\theta}\qty(x))^{2} \end{equation}


  1. in a forward pass, compute the values of each value \(z^{(1)}, a^{(1)}, \ldots\)
  2. in a backward pass, compute…
    1. \(\pdv{J}{z^{(f)}}\): by hand
    2. \(\pdv{J}{a^{(f-1)}}\): lemma 3 below
    3. \(\pdv{J}{z^{f-1}}\): lemma 2 below
    4. \(\pdv{J}{a^{(f-2)}}\): lemma 3 below
    5. \(\pdv{J}{z^{f-2}}\): lemme 2 below
    6. and so on… until we get to the first layer
  3. after obtaining all of these, we compute the weight matrices'
    1. \(\pdv{J}{W^{(f)}}\): lemma 1 below
    2. \(\pdv{J}{W^{(f-1)}}\): lemma 1 below
    3. …, until we get tot he first layer

chain rule lemmas

Pattern match your expressions against these, from the last layer to the first layer, to amortize computation.

deep learning

Last edited: October 10, 2025

supervised learning with non-linear models.

Motivation

Previously, our learning method was linear in the parameters \(\theta\) (i.e. we can have non-linear \(x\), but our \(\theta\) is always linear). Today: with deep learning we can have non-linearity with both \(\theta\) and \(x\).

constituents

  • We have \(\qty {\qty(x^{(i)}, y^{(i)})}_{i=1}^{n}\) the dataset
  • Our loss \(J^{(i)}\qty(\theta) = \qty(y^{(i)} - h_{\theta}\qty(x^{(i)}))^{2}\)
  • Our overall cost: \(J\qty(\theta) = \frac{1}{n} \sum_{i=1}^{n} J^{(i)}\qty(\theta)\)
  • Optimization: \(\min_{\theta} J\qty(\theta)\)
  • Optimization step: \(\theta = \theta - \alpha \nabla_{\theta} J\qty(\theta)\)
  • Hyperparameters:
    • Learning rate: \(\alpha\)
    • Batch size \(B\)
    • Iterations: \(n_{\text{iter}}\)
  • stochastic gradient descent (where we randomly sample a dataset point, etc.) or batch gradient descent (where we scale learning rate by batch size and comput e abatch)
  • neural network

requirements

additional information

Background

Notation:

Gaussian mixture model

Last edited: October 10, 2025

Gaussian mixture model is a density estimation technique, which is useful for detecting out of distribution samples, etc.

We will use the superposition for a group of Gaussian distributions that would explain the dataset.

Suppose the data was generated from a Mixture of Gaussian; then for every data point \(x^{(i)}\) there is a latent \(z^{(i)}\) which tells you what Gaussian your data point is generated from.

So, for \(k\) Gaussian in your mixture:

\(z^{(i)} \in \qty {1, \dots, k}\) such that \(z^{(i)} \sim \text{MultiNom}\qty(\phi)\) (such that \(\phi_{j} \geq 0\), \(\sum_{j}^{} \phi_{j} = 1\))

matrix calculus

Last edited: October 10, 2025

Transpose Rules

  • \(\qty(AB)^{T} = B^{T}A^{T}\)
  • \(\qty(a^{T}Bc)^{T} = c^{T} B^{T}a\)
  • \(a^{T}b = b^{T}a\)
  • \(\qty(A+B)C = AC + BC\)
  • \(\qty(a+b)^{T}C = a^{T}C + b^{T}C\)
  • \(AB \neq BA\)

Derivative

Scalar derivativeVector derivative
\(f\qty(x) \to \pdv{f}{x}\)\(f\qty(x) \to \pdv{f}{x}\)
\(bx \to b\)\(x^{T}B \to B\)
\(bx \to b\)\(x^{T}b \to b\)
\(x^{2} \to 2x\)\(x^{T}x \to 2x\)
\(bx^{2} \to 2bx\)\(x^{T}Bx \to 2Bx\)

Products

\begin{equation} \pdv{AB}{A} = B^{T}, \pdv{AB}{B} = A^{T} \end{equation}

\begin{equation} \pdv{Ax}{A} = x^{T}, \pdv{Ax}{x}= A \end{equation}

Normal Equation

Last edited: October 10, 2025

constituents

Let’s also define our entire training examples and stack them in rows:

\begin{equation} X = \mqty( - x^{(1)}^{T} - \\ \dots \\ - x^{\qty(n)}^{T} - ) \end{equation}

\begin{equation} Y = \mqty(y^{(1)} \\ \dots \\ y^{(n)}) \end{equation}

requirements

least-squares error becomes:

\begin{equation} J\qty(\theta) = \frac{1}{2} \sum_{i=1}^{n} \qty(h\qty(x^{(i)}) - y^{(i)}) ^{2} = \qty(X \theta - y)^{T} \qty(X \theta - y) \end{equation}

Solving this exactly by taking the derivative of \(J\) and set it to \(0\) (i.e. for a minima, we obtain)