SU-CS229 Midterm Sheet

Last edited: October 10, 2025

matrix calculus
supervised learning
gradient descent
Newton’s Method
regression
- \(y|x,\theta\) is linear Linear Regression
  - least-squares error: probabilistic intuition for least-squares error in linear regression
  - Normal Equation
- What if \(X\) and \(y\) are not linearly related? Generalized Linear Model
  - \(y|x, \theta\) can be any distribution that’s exponential family
  - some exponential family distributinos: SU-CS229 Distribution Sheet
classification
- take linear regression, squish it: logistic regression; for multi-class, use softmax \(p\qty(y=k|x) = \frac{\exp \theta_{k}^{T} x}{\sum_{j}^{} \exp \theta_{j}^{T} x}\)
- generative learning
  - modeling each class’ distributions, and then check which one is more likely: GDA
  - Naive Bayes
bias variance tradeoff
regularization
unsupervised learning
- k-means clustering
- Gaussian mixture model and expectation maximization
  - Jensen’s Inequality
feature map and precomputing Kernel Trick
Decision Tree

SU-CS229 OCT292025

Last edited: October 10, 2025

xavier initialization

Last edited: October 10, 2025

An neural network initialization scheme that tries to avoid Vanishing Gradients.

Consider \(Wx\) step in a neural network:

\begin{equation} o_{i} = \sum_{j=1}^{n_{\text{in}}} w_{ij} x_{j} \end{equation}

The variance of this:

\begin{equation} \text{Var}\qty [o_{i}] = n_{\text{in}} \sigma^{2} v^{2} \end{equation}

expectation maximization

Last edited: October 10, 2025

Sorta like “distribution-based k-means clustering”. Jensen’s Inequality guarantees convergence (i.e. each parameter will converge to the maximum possible parameter).

constituents

A Gaussian mixture model!

requirements

Two steps:

e-step

“guess the value of \(z^{(i)}\); soft guesses of cluster assignments”

\begin{align} w_{j}^{(i)} &= p\qty(z^{(i)} = j | x^{(i)} ; \phi, \mu, \Sigma) \\ &= \frac{p\qty(x^{(i)} | z^{(i)}=j) p\qty(z^{(i)}=j))}{\sum_{l=1}^{k}p\qty(x^{(i)} | z^{(i)}=l) p\qty(z^{(i)}=l))} \end{align}

Where we have:

\(p\qty(x^{(i)} |z^{(i)}=j)\) from the Gaussian distribution, where we have \(\Sigma_{j}\) and \(\mu_{j}\) for the parameters of our Gaussian \(j\).
\(p\qty(z^{(i)} =j)\) is just \(\phi_{j}\) which we are learning

These weights \(w_{j}\) are how much the model believes it belongs to each cluster.

Kosaraju's Algorithm

Last edited: October 10, 2025

Finding strongly connected components.

Do DFS starting at random node to create a DFS forest (i.e. disconnected DFS trees), keeping track of finishing times
Reverse all edges in the graph
Do DFS again to create another DFS forest, start DFS procedure with the vertices with the largest finishing time in step 1
strongly connected components are the trees that changed from the first in the second DFS graph

main idea

Recall that finishing time helps us track topological sort. After reversing edges, the condensation graph node with the largest finish time has no edges going out (i.e. it was originally the first in the topological sort); if we run DFS there, we will find exactly that component. After DFSing there, we will have “removed” that component (having visited it), and repeat.