Neural networks are a non-linear learning architecture that involves a combination of matrix multiplication and entry-wise non-linear operations.
two layers
constituents
Consider a two layer neural network with:
- \(m\) hidden units
- \(d\) dimensional input \(x \in \mathbb{R}^{d}\)
requirements
\begin{align} &\forall j \in \qty {1, \dots, m}\\ &z_{j} = w_{j}^{(1)}^{T} x + b_{j}^{(1)}\\ &a_{j} = \text{ReLU}\qty(z_{j}) \\ &a = \qty(a_1, \dots, a_{m})^{T} \in \mathbb{R}^{m} \\ &h_{\theta} \qty(x) = w^{(2)}^{T} a + b^{(2)} \end{align}
\(z_{j}\) are hidden units, \(a_{j}\) are activated hidden units, \(h_{\theta}\) is the prediction function.
vectorized two-layer
constituents
- \(m\) hidden units per layer
- \(d\) input dimension
requirements
\begin{equation} W^{(1)} = \mqty[w_1^{(1)}^{T} \\ \dots \\ w_m^{(1)}^{T} ] \end{equation}
which emits a \(m \times d\) matrix. So this gives:
\begin{equation} \mqty[z_1 \\ \dots \\ z_{M}] = \mqty[w_1^{(1)}^{T} \\ \dots \\ w_m^{(1)}^{T} ] \mqty[x_1 \\ \dots \\ x_{D}] + \mqty[b_1^{( 1 )} \\ \dots \\ b_m^{( 1 )}] \end{equation}
where \(z \in \mathbb{R}^{m \times 1}, w^{(1)} \in \mathbb{R}^{m \times d}, x \in \mathbb{R}^{d \times 1} , b^{(j)} \in \mathbb{R}^{m \times 1}\). Writing this as matrix operations:
\begin{equation} z = w^{(1)} x + b^{(1)} \end{equation}
and
\begin{equation} a = \text{ReLU}\qty(z) \end{equation}
with:
\begin{equation} h_{\theta}\qty(x) = w^{(2)} a + b^{(2)} \end{equation}
multi-layer
\begin{equation} a^{(1)} = \text{ReLU}\qty(W^{(1)} x + b^{(1)}) \end{equation}
\begin{equation} a^{(2)} = ReLU\qty(W^{(2)} a^{(1)} + b^{(2)}) \end{equation}
and so on…
\begin{equation} a^{(r-1)} = \text{ReLU}\qty(W^{(r-1)} a^{(r-2)} + b^{(r-1)}) \end{equation}
\begin{equation} h_{\theta}\qty(x) = W^{r} a^{r-1} + b^{r} \end{equation}
metadata
- total number of neurons: \(m_1 + m_2 + … + m_{r}\)
- number of parameters: \(\qty(d+1) m_1 + \qty(m_{1}+1)m_{2} + … + \qty(m_{r-1}+1)m_{r}\)
additional information
Neural networks admit a local optima, and we cannot find a global optima.
neuron
Consider first a single neuron neural network in one dimension. For instance, let’s think of a slightly non-linear case first:
\begin{align} h_{\theta}\qty(x) &= \max \qty(wx+b, 0) \end{align}
it admits two parameters, \(\theta = \qty(w, b) \in \mathbb{R}^{2}\). Such a function is called relu function. What if we have multiple input features? Consider: \(x \in \mathbb{R}^{d}\), \(w \in \mathbb{R}^{d}\), and \(b \in \mathbb{R}\). Now:
\begin{equation} h_{\theta} \qty(x) = \text{ReLU}\qty(w^{\top}x + b) \end{equation}
We refer to relu function as an “Activation Function”.
neurons
We can write latent units in terms of the input units, as well as parameters to weight them; for instance:
\begin{equation} a_1 = \text{ReLU}\qty(\theta_{1} x_1 + \theta_{2} x_2 + \theta_{3}) \end{equation}
Instead of writing this directly, we can just make every neuron connected to every other neuron, resulting in:
\begin{equation} a_1 = \text{ReLU}\qty(w_1^{T} x + b_1) \end{equation}
\begin{equation} a_2 = \text{ReLU}\qty(w_2^{T} x + b_2) \end{equation}
and so on.
why would the neurons learn different things?
Because random initializations will give local minims, and each node would specialize.
some activation functions
see also Neural Networks
see Neural Networks
kernel methods vs deep learning
Instead of using Kernel Trick and feature map to extract features yourselves, deep learning promises to learn the correct feature map after multiple, non-linear layers.
Consider \(\beta\) as the parameters of a fully-connected neural network, then the final hypothesis function of a neural network is:
\begin{equation} h_{\theta} = W^{r} \phi_{\beta} \qty(x) + b^{( r)} \end{equation}
in some sense, the entire damn neural network is a feature map for the final, linear output head. We can therefore think of training a neural network as automatically finding a feature map \(\phi_{\beta}\), as well as learning a classifier for that feature map.
