Linear Regression

Suppose you have a dataset of two features, you can write a predictor \(h\qty(x)\) in:

\begin{equation} h\qty(x) = \theta_{0} + \theta_{1} x_{1} + \theta_{2} x_{2} \end{equation}

This is a smidge unwieldy, because we have to keep tacking up \(\theta\) terms with an exception whenever we have new features. So, a trick, is that we set \(x_0 = 1\). This yields equivalently…

definition

\begin{equation} h\qty(x) = \sum_{j=0}^{m} \theta_{j} x_{j} = \theta^{T} x \end{equation}

where: \(\theta = \mqty(\theta_{0} \\ \dots \\ \theta_{m})\),

for \(m\) features.

additional information

finding good \(\theta\)

To be able to solve for the optimal parameters, we want to minimize least-squares error. In particular, for:

\begin{equation} J\qty(\theta) = \frac{1}{2} \sum_{i=1}^{n}\qty(h_{\theta }\qty(x^{(i)}) - y^{(i)})^{2} \end{equation}

we compute:

\begin{equation} \theta = \arg\min_{\theta} J\qty(\theta) \end{equation}

To do this, we use gradient descent.

Linear Regression with Nonlinear Bases

See Linear Regression with Nonlinear Bases

Locally-Weighted Regression

Let’s consider a scenario where your data isn’t entirely linear. Naively, you can feature engineer your way until the data becomes linear (e.g. set \(h\qty(x) = \theta_{0} + \theta_{1} x + \theta_{2} x^{2}, etc.)\). Feature engineering, however, sucks.

So instead, what if we just find a best fit linear function near the point of interest we want to perform regression? Assuming the function is locally linear, let’s wait until the user input to the regression, and find a fresh \(\theta\) local to that user input and upweight training points near the point of interest at test time to fit a new function.

constituents

dataset following notation
a test time input point \(x\)
an affect size parameter \(\tau\)

requirements

Fit \(\theta\) to minimize:

\begin{equation} \frac{1}{2} \sum_{i=1}^{n} w^{(i)} \qty(y_{i} - \theta^{T} x^{(i)})^{2} \end{equation}

where \(w^{(i)} = \exp \qty(\frac{-\qty(x^{(i)-x})^{2}}{2 \tau^{2}})\).

if \(|x^{(i)} -x|\) is small, \(w^{(i)} \approx 1\) — we care about local points more
if \(|x^{(i)}-x|\) is large, \(w^{(i)} \approx 0\) — we care about far away points less

\(\tau\) controls the width of the effect size bump; bigger \(\tau\), the wider the effect size (i.e. the more that the farther away points matter).

additional information

non-parametricity
One major drawback of this is that its a Non-Parametric Learning Algorithm, meaning you have to keep the whole dataset around.