Algorithms Index
Last edited: October 10, 2025Lectures
Divide and Conquer
Sorting
- merge sort: SU-CS161 SEP252025
- recurrence solving: SU-CS161 SEP302025
- median: SU-CS161 OCT022025
- randomized algos + quicksort: SU-CS161 OCT072025
- linear time sorting: SU-CS161 OCT092025
Data Structures
- red-black trees: SU-CS161 OCT142025
- hashing: SU-CS161 OCT212025
Graphs
- DFS/BFS: SU-CS161 OCT232025
- Strongly connected components: SU-CS161 OCT282025
- Dijikstra and Bellman-Ford: SU-CS161 OCT302025
DP
Greedy Algorithms
Closing
backpropegation
Last edited: October 10, 2025backpropegation is a special case of “backwards differentiation” to update a computation grap.h
constituents
- chain rule: suppose \(J=J\qty(g_{1}…g_{k}), g_{i} = g_{i}\qty(\theta_{1} \dots \theta_{p})\), then \(\pdv{J}{\theta_{i}} = \sum_{j=1}^{K} \pdv{J}{g_{j}} \pdv{g_{j}}{\theta_{i}}\)
- a neural network
requirements
Consider the notation in the following two layer NN:
\begin{equation} z = w^{(1)} x + b^{(1)} \end{equation}
\begin{equation} a = \text{ReLU}\qty(z) \end{equation}
\begin{equation} h_{\theta}\qty(x) = w^{(2)} a + b^{(2)} \end{equation}
\begin{equation} J = \frac{1}{2}\qty(y - h_{\theta}\qty(x))^{2} \end{equation}
- in a forward pass, compute the values of each value \(z^{(1)}, a^{(1)}, \ldots\)
- in a backward pass, compute…
- \(\pdv{J}{z^{(f)}}\): by hand
- \(\pdv{J}{a^{(f-1)}}\): lemma 3 below
- \(\pdv{J}{z^{f-1}}\): lemma 2 below
- \(\pdv{J}{a^{(f-2)}}\): lemma 3 below
- \(\pdv{J}{z^{f-2}}\): lemme 2 below
- and so on… until we get to the first layer
- after obtaining all of these, we compute the weight matrices'
- \(\pdv{J}{W^{(f)}}\): lemma 1 below
- \(\pdv{J}{W^{(f-1)}}\): lemma 1 below
- …, until we get tot he first layer
chain rule lemmas
Pattern match your expressions against these, from the last layer to the first layer, to amortize computation.
deep learning
Last edited: October 10, 2025supervised learning with non-linear models.
Motivation
Previously, our learning method was linear in the parameters \(\theta\) (i.e. we can have non-linear \(x\), but our \(\theta\) is always linear). Today: with deep learning we can have non-linearity with both \(\theta\) and \(x\).
constituents
- We have \(\qty {\qty(x^{(i)}, y^{(i)})}_{i=1}^{n}\) the dataset
- Our loss \(J^{(i)}\qty(\theta) = \qty(y^{(i)} - h_{\theta}\qty(x^{(i)}))^{2}\)
- Our overall cost: \(J\qty(\theta) = \frac{1}{n} \sum_{i=1}^{n} J^{(i)}\qty(\theta)\)
- Optimization: \(\min_{\theta} J\qty(\theta)\)
- Optimization step: \(\theta = \theta - \alpha \nabla_{\theta} J\qty(\theta)\)
- Hyperparameters:
- Learning rate: \(\alpha\)
- Batch size \(B\)
- Iterations: \(n_{\text{iter}}\)
- stochastic gradient descent (where we randomly sample a dataset point, etc.) or batch gradient descent (where we scale learning rate by batch size and comput e abatch)
- neural network
requirements
additional information
Background
Notation:
Dijikstra's Algorithm
Last edited: October 10, 2025constituents
- a weighted directed graph, meaning edges have weights
- where cost of a path is the sum of the weights along that path
- the shortest path is the path with the minimum cost
- starting node \(s\), target node \(t\)
Each node has two states:
- unlabeled
- labeled
And stores a piece of information:
\(d\qty(v) = \text{distance}\qty(s, v)\)
We initialize \(d\qty(v) = \infty, v \neq s\) , and \(d\qty(s) = 0\).
requirements
- pick the node \(u\) that has unlabeled state with smallest \(d\qty(u)\)
- for all neighbor v of u:
- set \(d\qty(v) = \min \qty(d\qty(v), d\qty(u) + \text{edgeWeight}\qty(u,v))\)
- mark \(v\) as labeled
def dike(G,s,t):
set verticies to State.NOTSURE
for v in G.V:
d[v] = float("+inf")
d[s] = 0
while State.NOTSURE in G.V:
u = min(V for v in G.V if v.state == State.NOTSURE)
for v in u.neighbors and v.state != SURE:
d[v] = min(d[v], d[u] + edgeWeight(u,v))
u.state = State.SURE
return d[t]
additional information
proof
Let’s start with a lemma
Gaussian mixture model
Last edited: October 10, 2025Gaussian mixture model is a density estimation technique, which is useful for detecting out of distribution samples, etc.
We will use the superposition for a group of Gaussian distributions that would explain the dataset.
Suppose the data was generated from a Mixture of Gaussian; then for every data point \(x^{(i)}\) there is a latent \(z^{(i)}\) which tells you what Gaussian your data point is generated from.
So, for \(k\) Gaussian in your mixture:
\(z^{(i)} \in \qty {1, \dots, k}\) such that \(z^{(i)} \sim \text{MultiNom}\qty(\phi)\) (such that \(\phi_{j} \geq 0\), \(\sum_{j}^{} \phi_{j} = 1\))
