Newton's Method

Last edited: October 10, 2025

\begin{equation} f(x) \approx f(x_{t-1}) + (x-x_{t-1}) f’(x_{t-1}) + \frac{(x-x_{t-1})^{2}}{2} f’’(x_{t-1}) \end{equation}

Taking a derivative with respect to this, we obtain:

\begin{equation} f’(x_{t-1}) + (x-x_{t-1}) f’’(x_{t-1}) \end{equation}

Solving the update equation for zero, we obtain that:

\begin{equation} x = x_{t-1} - \frac{f’(x_{t-1})}{f’’(x_{t-1})} \end{equation}

This converges quadratically!! to solve for minima/maxima. For solving minima/maxima for gardients:

\begin{equation} x_{t} = x_{t-1} - \qty(\bold{H}_{g})^{-1}\nabla J \end{equation}

for Hessian \(H\) and gradient \(g\).

For finding ZEROS instead of minima:

regularization

Last edited: October 10, 2025

regularization penalize large weights to reduce overfitting

create data interpolation that countains intentional error (by throwing away/hiding parameters), missing some/all of the data points
this makes the resulting function more “predictable”/“smooth”

there is, therefore, a trade-off between sacrificing quality and on the ORIGINAL data and better accuracy on new points. If you regularize too much, you will underfit.

Motivation

Recall that, for linear regression, we want to optimize:

\begin{equation} \min_{\theta} \frac{1}{2} \sum_{i=1}^{n} \norm{ y^{(i)} - \theta^{\top}x^{(i)} }^{2} \end{equation}

depth first search

Last edited: October 10, 2025

Each vertex keeps track of three states: visited, in progress, done. We also keep track of when we going to enter it and when we are also be done

# initialize vertex.state = UNVISITED across graph

def dfs(vertex, currentTime):
    vertex.startTime = currentTime
    currentTime++

    vertex.state = IN_PROGRESS

    for v in vertex.neightbors:
        if v.state == UNVISITED: # to prevent loops
            currentTime = dfs(v, currentTime)
            currentTime += 1
    w.finishTime = currentTime
    w.state == DONE

    return currentTime

This explores all connected component starting from each vertex, so presumably you have to repeatedly by iterating through all verticies.

graph

Last edited: October 10, 2025

undirected graph

constituents

vertices’s \(V\), a set of values of node
edges \(E\), a set of unordered sets of verticies

requirements

We write:

\begin{equation} G = \qty(V, E) \end{equation}

We say:

degree of vertex is the number of edges coming out
connected verticies are neighbors

additional information

directed graph

constituents

vertices’s \(V\), a set of values of node
edges \(E\), a set of directed edges

requirements

We write:

\begin{equation} G = \qty(V, E) \end{equation}

Markov Chain Monte-Carlo

Last edited: October 10, 2025

Start with an initial sample \(\tau\)

At each distribution…

sample \(\tau ’ \sim g\qty(\cdot | \tau)\) (for instance, \(\tau’ \sim \mathcal{N}\qty(\cdot | \tau, \sigma^{2})\))
accept the sample with probability given by \(\frac{\bar{p} \qty(\tau’) g\qty(\tau | \tau’)}{\bar{p}\qty(\tau) g\qty(\tau’ | \tau)}\), otherwise keep \(\tau\) (this is also called the Metropolis-Hastings criteria)

intuition

The kernel is often chosen to be symmetric, so:

\begin{equation} g\qty(\tau | \tau’) = g\qty(\tau’ | \tau) \end{equation}

we want to sample from areas of high density more often. Consider: