Why do Neural Nets Work Suddenly?
Regularization
We want to be able to manipulate our parameters so that our models learn better—for instance, we want our weights to be low:
\begin{equation} J_{L2}(\theta) = J_{reg}(\theta) + \lambda \sum_{k}^{} \theta^{2}_{k} \end{equation}
or good ‘ol dropout—“fetaure dependent regularization”
Motivation
- classic view: regularization works to prevent overfitting when we have a lot of features
- NEW view with big models: regularization produces generalizable models when parameter count is big enough
Dropout
Dropout: prevents feature co-adaptation => results in good regularization
Language Model
See Language Model