requirements
- \(h\qty(x)\) the predictor function
- \(x,y\), the samples of data
definition
\begin{equation} J\qty(\theta) = \frac{1}{2} \sum_{i=1}^{n}\qty(h_{\theta }\qty(x^{(i)}) - y^{(i)})^{2} \end{equation}
see also example: gradient descent for least-squares error.
additional information
“why the 1/2”?
Because when you take \(\nabla J\qty(\theta)\) you end up with the \(\frac{1}{2}\) and the \(2\) canceling out.
probabilistic intuition for least-squares error in linear regression
Assume that our dataset \(\qty(x^{(i)}, y^{(i)}) \sim D\) has the following property: “the true \(y\) value is just our model’s output, plus some error.” Meaning:
\begin{equation} y^{(i)} = \theta^{\top} x^{(i)} + \varepsilon^{(i)} \end{equation}
Assume too now that \(\varepsilon^{(i)} \sim \mathcal{N}\qty(0, \sigma^{2})\) for all \(i\), that the error is normally distributed. Recall the PDF of the normal distribution:
\begin{equation} P\qty(\varepsilon^{(i)}) = \frac{1}{\sigma\sqrt{2\pi}} \exp \qty( \frac{- \qty(\epsilon^{(i)})^{2}}{2\sigma^{2}}) \end{equation}
Plugging in our definition for \(\varepsilon\) here:
\begin{equation} P\qty(y^{(i)} | x^{(i)}, \theta) = \frac{1}{\sigma\sqrt{2\pi}} \exp \qty( \frac{- \qty(y^{(i)}- \theta^{T}x^{(i)})^{2}}{2\sigma^{2}}) \end{equation}
If we now assume the entire dataset is IID, we can then write:
\begin{align} P\qty(y | x, \theta) &= \prod_{i=1}^{n} P\qty(y^{(i)} | x^{(i)}, \theta) \\ &= \prod_{i=1}^{n} \frac{1}{\sigma\sqrt{2\pi}} \exp \qty( \frac{- \qty(y^{(i)}- \theta^{T}x^{(i)})^{2}}{2\sigma^{2}}) \end{align}
What we want to pick \(\theta\) is to perform MLE—indeed we want the model that maximizes the likelihood of seeing our real data \(y\). Meaning, we desire:
\begin{equation} \theta = \arg\max_{\theta} P\qty(y | x,\theta) \end{equation}
Let’s do it! First let’s write the thing we want to maximize as a function of \(\theta\)
\begin{equation} L\qty(\theta) = \frac{1}{\sigma\sqrt{2\pi}} \exp \qty( \frac{- \qty(y^{(i)}- \theta^{T}x^{(i)})^{2}}{2\sigma^{2}}) \end{equation}
recall log is monotonic, so
\begin{align} \arg\max_{\theta} L\qty(\theta) &= \arg\max_{\theta} \log \qty(L\qty(\theta)) \\ &= \arg\max_{\theta} \log \prod_{i=1}^{n}\frac{1}{\sigma\sqrt{2\pi}} \exp \qty(\dots) \\ &= \arg\max_{\theta} n \log \frac{1}{\sigma\sqrt{2\pi}} + \sum_{i=1}^{n} \frac{-\qty(y^{(i)}- \theta^{\top}x^{(i)})^{2}}{2\sigma^{2}} \end{align}
We can throw away the left term (since its just a constant, and the objective function of the right is just the least-squares error formula, with \(\sigma=1\) (i.e. it doesn’t matter since we are just trying to maximize)! Yay!
