Suppose you have a dataset of two features, you can write a predictor \(h\qty(x)\) in:
\begin{equation} h\qty(x) = \theta_{0} + \theta_{1} x_{1} + \theta_{2} x_{2} \end{equation}
This is a smidge unwieldy, because we have to keep tacking up \(\theta\) terms with an exception whenever we have new features. So, a trick, is that we set \(x_0 = 1\). This yields equivalently…
definition
\begin{equation} h\qty(x) = \sum_{j=0}^{m} \theta_{j} x_{j} = \theta^{T} x \end{equation}
where: \(\theta = \mqty(\theta_{0} \\ \dots \\ \theta_{m})\),
for \(m\) features.
additional information
finding good \(\theta\)
To be able to solve for the optimal parameters, we want to minimize least-squares error. In particular, for:
\begin{equation} J\qty(\theta) = \frac{1}{2} \sum_{i=1}^{n}\qty(h_{\theta }\qty(x^{(i)}) - y^{(i)})^{2} \end{equation}
we compute:
\begin{equation} \theta = \arg\min_{\theta} J\qty(\theta) \end{equation}
To do this, we use gradient descent.
Linear Regression with Nonlinear Bases
See Linear Regression with Nonlinear Bases
Locally-Weighted Regression
Let’s consider a scenario where your data isn’t entirely linear. Naively, you can feature engineer your way until the data becomes linear (e.g. set \(h\qty(x) = \theta_{0} + \theta_{1} x + \theta_{2} x^{2}, etc.)\). Feature engineering, however, sucks.
So instead, what if we just find a best fit linear function near the point of interest we want to perform regression? Assuming the function is locally linear, let’s wait until the user input to the regression, and find a fresh \(\theta\) local to that user input and upweight training points near the point of interest at test time to fit a new function.
constituents
- dataset following notation
- a test time input point \(x\)
- an affect size parameter \(\tau\)
requirements
Fit \(\theta\) to minimize:
\begin{equation} \frac{1}{2} \sum_{i=1}^{n} w^{(i)} \qty(y_{i} - \theta^{T} x^{(i)})^{2} \end{equation}
where \(w^{(i)} = \exp \qty(\frac{-\qty(x^{(i)-x})^{2}}{2 \tau^{2}})\).
- if \(|x^{(i)} -x|\) is small, \(w^{(i)} \approx 1\) — we care about local points more
- if \(|x^{(i)}-x|\) is large, \(w^{(i)} \approx 0\) — we care about far away points less
\(\tau\) controls the width of the effect size bump; bigger \(\tau\), the wider the effect size (i.e. the more that the farther away points matter).
additional information
non-parametricity
One major drawback of this is that its a Non-Parametric Learning Algorithm, meaning you have to keep the whole dataset around.