SU-CS224N APR092024
Last edited: October 10, 2025Neural Networks are powerful because of self organization of the intermediate levels.
Neural Network Layer
\begin{equation} z = Wx + b \end{equation}
for the output, and the activations:
\begin{equation} a = f(z) \end{equation}
where the activation function \(f\) is applied element-wise.
Why are NNs Non-Linear?
- there’s no representational power with multiple linear (though, there is better learning/convergence properties even with big linear networks!)
- most things are non-linear!
Activation Function
We want non-linear and non-threshold (0/1) activation functions because it has a slope—meaning we can perform gradient-based learning.
topological sort
Last edited: October 10, 2025For directed acyclic graphs, a topological sort of a directed graph is such that if there’s an edge \(A \to B\), then \(A\) comes before \(B\) in the sort (i.e. there’s not an edge from \(B\) to \(A\)). Under direct acyclic graphs, a topological sort always exist.
solving topological sort with depth first search
In a DAG, you can always go from larger finish times to smaller finish times in depth first search to be able to get a topological sort.
Decision Tree
Last edited: October 10, 2025Let’s consider greedy Decision Tree learning.
greedy procedure
- initial tree—no split: always predict the majority class \(\hat{y} = \text{maj}\qty(y), \forall x\)
- for each feature \(h\qty(x)\)
- split data according to feature
- compute classification error of the split
- choose \(h^{*}\qty(x)\) with the lowest error after splitting
- loop until stop
stopping criteria
- each node agrees on \(y\) (the tree fits data exactly)
- exhausted on all features (nothing to split on)
additional information
threshold splitting
We are going to perform what’s called a “threshold split.” Choose thresholds between two points as the “split values” to check. Now, how do we deal with splitting twice? We can until we get bored or we over fit.
exponential family
Last edited: October 10, 2025Exponential Family is a family of distributions following exponentials.
constituents
- \(y\) the data
- \(\eta\) the natural parameter — vector or scalar
- \(T\qty(y)\) the “sufficient statistic” (this is usually just \(y\)) — vector or scalar
- \(b\qty(y)\) the base parameter — scalar
- \(a\qty(\eta)\) the log-partition function — scalar
requirements
A class of distributions is in the Exponential Family if it can be written as:
\begin{align} P\qty(y \mid \eta) &= b\qty(y) \exp \qty(\eta^{\top}T\qty(y)-a\qty(\eta)) \\ &= \frac{b\qty(y) \exp \qty(\eta^{\top} T\qty(y))}{e^{a\qty(\eta)}} \end{align}
logistic regression
Last edited: October 10, 2025Using Linear Regression to perform a classification task \(y \in \qty {0,1}\) sounds kind of silly. We assume that \(y\) follows a kind of Bernoulli distribution.
requirements
\begin{align} h_{\theta}\qty(x) &= g\qty(\theta^{T} x) \\ &= \frac{1}{1+e^{-\theta^{T}x}} \end{align}
That is, we apply a sigmoid function to our linear regression output to perform classification. Such a sigmoid function has the following properties:
\begin{equation} p\qty(y=1|x;\theta) = h_{\theta}\qty(x) \end{equation}
\begin{equation} p\qty(y=0|x;\theta) = 1-h_{\theta}\qty(x) \end{equation}
\begin{equation} y \in \qty {0,1} \end{equation}
