regularization penalize large weights to reduce overfitting
- create data interpolation that countains intentional error (by throwing away/hiding parameters), missing some/all of the data points
- this makes the resulting function more “predictable”/“smooth”
there is, therefore, a trade-off between sacrificing quality and on the ORIGINAL data and better accuracy on new points. If you regularize too much, you will underfit.
Motivation
Recall that, for linear regression, we want to optimize:
\begin{equation} \min_{\theta} \frac{1}{2} \sum_{i=1}^{n} \norm{ y^{(i)} - \theta^{\top}x^{(i)} }^{2} \end{equation}
We can lower model complexity by reducing the weights, which gives us:
\begin{equation} \min_{\theta} \frac{1}{2} \sum_{i=1}^{n} \norm{ y^{(i)} - \theta^{\top}x^{(i)} }^{2} + \lambda \norm{\theta}^{2} \end{equation}
Lasso
The lasso uses an $L$-1 norm on the weights
\begin{equation} \min_{\theta} |y - X \theta|_{2}^{2} + \lambda | \theta |_{1}^{2} \end{equation}
which will downselect weights that are not useful.
Which has no closed form.
Regularization Intuition
- L1 encourages sparsity of the weights (setting some to 0)
- L2 encouraging smaller values of the weights (weight shrinkage)
