deep learning

supervised learning with non-linear models.

Motivation

Previously, our learning method was linear in the parameters \(\theta\) (i.e. we can have non-linear \(x\), but our \(\theta\) is always linear). Today: with deep learning we can have non-linearity with both \(\theta\) and \(x\).

constituents

  • We have \(\qty {\qty(x^{(i)}, y^{(i)})}_{i=1}^{n}\) the dataset
  • Our loss \(J^{(i)}\qty(\theta) = \qty(y^{(i)} - h_{\theta}\qty(x^{(i)}))^{2}\)
  • Our overall cost: \(J\qty(\theta) = \frac{1}{n} \sum_{i=1}^{n} J^{(i)}\qty(\theta)\)
  • Optimization: \(\min_{\theta} J\qty(\theta)\)
  • Optimization step: \(\theta = \theta - \alpha \nabla_{\theta} J\qty(\theta)\)
  • Hyperparameters:
    • Learning rate: \(\alpha\)
    • Batch size \(B\)
    • Iterations: \(n_{\text{iter}}\)
  • stochastic gradient descent (where we randomly sample a dataset point, etc.) or batch gradient descent (where we scale learning rate by batch size and comput e abatch)
  • neural network

requirements

additional information

Background

Notation:

\(x\) is the input, \(h\) is the hidden layers, and \(\hat{y}\) is the prediction.

We call each weight, at each layer, from \(x_{i}\) to \(h_{j}\), \(\theta_{i,j}^{(h)}\). At every neuron on each layer, we calculate:

\begin{equation} h_{j} = \sigma\qty[\sum_{i}^{} x_{i} \theta_{i,j}^{(h)}] \end{equation}

\begin{equation} \hat{y} = \sigma\qty[\sum_{i}^{} h_{i}\theta_{i}^{(y)}] \end{equation}

note! we often