Posts

Algorithms Index

Last edited: October 10, 2025

SU-CS161 Things to Review

Lectures

Divide and Conquer

Sorting

Data Structures

Graphs

DP

Greedy Algorithms

Closing

backpropegation

Last edited: October 10, 2025

backpropegation is a special case of “backwards differentiation” to update a computation grap.h

constituents

  • chain rule: suppose \(J=J\qty(g_{1}…g_{k}), g_{i} = g_{i}\qty(\theta_{1} \dots \theta_{p})\), then \(\pdv{J}{\theta_{i}} = \sum_{j=1}^{K} \pdv{J}{g_{j}} \pdv{g_{j}}{\theta_{i}}\)
  • a neural network

requirements

Consider the notation in the following two layer NN:

\begin{equation} z = w^{(1)} x + b^{(1)} \end{equation}

\begin{equation} a = \text{ReLU}\qty(z) \end{equation}

\begin{equation} h_{\theta}\qty(x) = w^{(2)} a + b^{(2)} \end{equation}

\begin{equation} J = \frac{1}{2}\qty(y - h_{\theta}\qty(x))^{2} \end{equation}


  1. in a forward pass, compute the values of each value \(z^{(1)}, a^{(1)}, \ldots\)
  2. in a backward pass, compute…
    1. \(\pdv{J}{z^{(f)}}\): by hand
    2. \(\pdv{J}{a^{(f-1)}}\): lemma 3 below
    3. \(\pdv{J}{z^{f-1}}\): lemma 2 below
    4. \(\pdv{J}{a^{(f-2)}}\): lemma 3 below
    5. \(\pdv{J}{z^{f-2}}\): lemme 2 below
    6. and so on… until we get to the first layer
  3. after obtaining all of these, we compute the weight matrices'
    1. \(\pdv{J}{W^{(f)}}\): lemma 1 below
    2. \(\pdv{J}{W^{(f-1)}}\): lemma 1 below
    3. …, until we get tot he first layer

chain rule lemmas

Pattern match your expressions against these, from the last layer to the first layer, to amortize computation.

deep learning

Last edited: October 10, 2025

supervised learning with non-linear models.

Motivation

Previously, our learning method was linear in the parameters \(\theta\) (i.e. we can have non-linear \(x\), but our \(\theta\) is always linear). Today: with deep learning we can have non-linearity with both \(\theta\) and \(x\).

constituents

  • We have \(\qty {\qty(x^{(i)}, y^{(i)})}_{i=1}^{n}\) the dataset
  • Our loss \(J^{(i)}\qty(\theta) = \qty(y^{(i)} - h_{\theta}\qty(x^{(i)}))^{2}\)
  • Our overall cost: \(J\qty(\theta) = \frac{1}{n} \sum_{i=1}^{n} J^{(i)}\qty(\theta)\)
  • Optimization: \(\min_{\theta} J\qty(\theta)\)
  • Optimization step: \(\theta = \theta - \alpha \nabla_{\theta} J\qty(\theta)\)
  • Hyperparameters:
    • Learning rate: \(\alpha\)
    • Batch size \(B\)
    • Iterations: \(n_{\text{iter}}\)
  • stochastic gradient descent (where we randomly sample a dataset point, etc.) or batch gradient descent (where we scale learning rate by batch size and comput e abatch)
  • neural network

requirements

additional information

Background

Notation:

Dijikstra's Algorithm

Last edited: October 10, 2025

constituents

  • a weighted directed graph, meaning edges have weights
  • where cost of a path is the sum of the weights along that path
  • the shortest path is the path with the minimum cost
  • starting node \(s\), target node \(t\)

Each node has two states:

  • unlabeled
  • labeled

And stores a piece of information:

\(d\qty(v) = \text{distance}\qty(s, v)\)

We initialize \(d\qty(v) = \infty, v \neq s\) , and \(d\qty(s) = 0\).

requirements

  1. pick the node \(u\) that has unlabeled state with smallest \(d\qty(u)\)
  2. for all neighbor v of u:
    1. set \(d\qty(v) = \min \qty(d\qty(v), d\qty(u) + \text{edgeWeight}\qty(u,v))\)
  3. mark \(v\) as labeled
def dike(G,s,t):
    set verticies to State.NOTSURE
    for v in G.V:
        d[v] = float("+inf")
    d[s] = 0
    while State.NOTSURE in G.V:
        u = min(V for v in G.V if v.state == State.NOTSURE)
        for v in u.neighbors and v.state != SURE:
            d[v] = min(d[v], d[u] + edgeWeight(u,v))
        u.state = State.SURE
    return d[t]

additional information

proof

Let’s start with a lemma

Gaussian mixture model

Last edited: October 10, 2025

Gaussian mixture model is a density estimation technique, which is useful for detecting out of distribution samples, etc.

We will use the superposition for a group of Gaussian distributions that would explain the dataset.

Suppose the data was generated from a Mixture of Gaussian; then for every data point \(x^{(i)}\) there is a latent \(z^{(i)}\) which tells you what Gaussian your data point is generated from.

So, for \(k\) Gaussian in your mixture:

\(z^{(i)} \in \qty {1, \dots, k}\) such that \(z^{(i)} \sim \text{MultiNom}\qty(\phi)\) (such that \(\phi_{j} \geq 0\), \(\sum_{j}^{} \phi_{j} = 1\))