Newton's Method
Last edited: October 10, 2025\begin{equation} f(x) \approx f(x_{t-1}) + (x-x_{t-1}) f’(x_{t-1}) + \frac{(x-x_{t-1})^{2}}{2} f’’(x_{t-1}) \end{equation}
Taking a derivative with respect to this, we obtain:
\begin{equation} f’(x_{t-1}) + (x-x_{t-1}) f’’(x_{t-1}) \end{equation}
Solving the update equation for zero, we obtain that:
\begin{equation} x = x_{t-1} - \frac{f’(x_{t-1})}{f’’(x_{t-1})} \end{equation}
This converges quadratically!! to solve for minima/maxima. For solving minima/maxima for gardients:
\begin{equation} x_{t} = x_{t-1} - \qty(\bold{H}_{g})^{-1}\nabla J \end{equation}
for Hessian \(H\) and gradient \(g\).
For finding ZEROS instead of minima:
regularization
Last edited: October 10, 2025regularization penalize large weights to reduce overfitting
- create data interpolation that countains intentional error (by throwing away/hiding parameters), missing some/all of the data points
- this makes the resulting function more “predictable”/“smooth”
there is, therefore, a trade-off between sacrificing quality and on the ORIGINAL data and better accuracy on new points. If you regularize too much, you will underfit.
Motivation
Recall that, for linear regression, we want to optimize:
\begin{equation} \min_{\theta} \frac{1}{2} \sum_{i=1}^{n} \norm{ y^{(i)} - \theta^{\top}x^{(i)} }^{2} \end{equation}
depth first search
Last edited: October 10, 2025Each vertex keeps track of three states: visited, in progress, done. We also keep track of when we going to enter it and when we are also be done
# initialize vertex.state = UNVISITED across graph
def dfs(vertex, currentTime):
vertex.startTime = currentTime
currentTime++
vertex.state = IN_PROGRESS
for v in vertex.neightbors:
if v.state == UNVISITED: # to prevent loops
currentTime = dfs(v, currentTime)
currentTime += 1
w.finishTime = currentTime
w.state == DONE
return currentTime
This explores all connected component starting from each vertex, so presumably you have to repeatedly by iterating through all verticies.
graph
Last edited: October 10, 2025undirected graph
constituents
- vertices’s \(V\), a set of values of node
- edges \(E\), a set of unordered sets of verticies
requirements
We write:
\begin{equation} G = \qty(V, E) \end{equation}
We say:
- degree of vertex is the number of edges coming out
- connected verticies are neighbors
additional information
directed graph
constituents
- vertices’s \(V\), a set of values of node
- edges \(E\), a set of directed edges
requirements
We write:
\begin{equation} G = \qty(V, E) \end{equation}
Markov Chain Monte-Carlo
Last edited: October 10, 2025Start with an initial sample \(\tau\)
At each distribution…
- sample \(\tau ’ \sim g\qty(\cdot | \tau)\) (for instance, \(\tau’ \sim \mathcal{N}\qty(\cdot | \tau, \sigma^{2})\))
- accept the sample with probability given by \(\frac{\bar{p} \qty(\tau’) g\qty(\tau | \tau’)}{\bar{p}\qty(\tau) g\qty(\tau’ | \tau)}\), otherwise keep \(\tau\) (this is also called the Metropolis-Hastings criteria)
intuition
The kernel is often chosen to be symmetric, so:
\begin{equation} g\qty(\tau | \tau’) = g\qty(\tau’ | \tau) \end{equation}
we want to sample from areas of high density more often. Consider:
