supervised learning with non-linear models.
Motivation
Previously, our learning method was linear in the parameters \(\theta\) (i.e. we can have non-linear \(x\), but our \(\theta\) is always linear). Today: with deep learning we can have non-linearity with both \(\theta\) and \(x\).
constituents
- We have \(\qty {\qty(x^{(i)}, y^{(i)})}_{i=1}^{n}\) the dataset
- Our loss \(J^{(i)}\qty(\theta) = \qty(y^{(i)} - h_{\theta}\qty(x^{(i)}))^{2}\)
- Our overall cost: \(J\qty(\theta) = \frac{1}{n} \sum_{i=1}^{n} J^{(i)}\qty(\theta)\)
- Optimization: \(\min_{\theta} J\qty(\theta)\)
- Optimization step: \(\theta = \theta - \alpha \nabla_{\theta} J\qty(\theta)\)
- Hyperparameters:
- Learning rate: \(\alpha\)
- Batch size \(B\)
- Iterations: \(n_{\text{iter}}\)
- stochastic gradient descent (where we randomly sample a dataset point, etc.) or batch gradient descent (where we scale learning rate by batch size and comput e abatch)
- neural network
requirements
additional information
Background
Notation:
\(x\) is the input, \(h\) is the hidden layers, and \(\hat{y}\) is the prediction.
We call each weight, at each layer, from \(x_{i}\) to \(h_{j}\), \(\theta_{i,j}^{(h)}\). At every neuron on each layer, we calculate:
\begin{equation} h_{j} = \sigma\qty[\sum_{i}^{} x_{i} \theta_{i,j}^{(h)}] \end{equation}
\begin{equation} \hat{y} = \sigma\qty[\sum_{i}^{} h_{i}\theta_{i}^{(y)}] \end{equation}
note! we often
