neural network

Neural networks are a non-linear learning architecture that involves a combination of matrix multiplication and entry-wise non-linear operations.

two layers

constituents

Consider a two layer neural network with:

\(m\) hidden units
\(d\) dimensional input \(x \in \mathbb{R}^{d}\)

requirements

\begin{align} &\forall j \in \qty {1, \dots, m}\\ &z_{j} = w_{j}^{(1)}^{T} x + b_{j}^{(1)}\\ &a_{j} = \text{ReLU}\qty(z_{j}) \\ &a = \qty(a_1, \dots, a_{m})^{T} \in \mathbb{R}^{m} \\ &h_{\theta} \qty(x) = w^{(2)}^{T} a + b^{(2)} \end{align}

\(z_{j}\) are hidden units, \(a_{j}\) are activated hidden units, \(h_{\theta}\) is the prediction function.

vectorized two-layer

constituents

\(m\) hidden units per layer
\(d\) input dimension

requirements

\begin{equation} W^{(1)} = \mqty[w_1^{(1)}^{T} \\ \dots \\ w_m^{(1)}^{T} ] \end{equation}

which emits a \(m \times d\) matrix. So this gives:

\begin{equation} \mqty[z_1 \\ \dots \\ z_{M}] = \mqty[w_1^{(1)}^{T} \\ \dots \\ w_m^{(1)}^{T} ] \mqty[x_1 \\ \dots \\ x_{D}] + \mqty[b_1^{( 1 )} \\ \dots \\ b_m^{( 1 )}] \end{equation}

where \(z \in \mathbb{R}^{m \times 1}, w^{(1)} \in \mathbb{R}^{m \times d}, x \in \mathbb{R}^{d \times 1} , b^{(j)} \in \mathbb{R}^{m \times 1}\). Writing this as matrix operations:

\begin{equation} z = w^{(1)} x + b^{(1)} \end{equation}

and

\begin{equation} a = \text{ReLU}\qty(z) \end{equation}

with:

\begin{equation} h_{\theta}\qty(x) = w^{(2)} a + b^{(2)} \end{equation}

multi-layer

\begin{equation} a^{(1)} = \text{ReLU}\qty(W^{(1)} x + b^{(1)}) \end{equation}

\begin{equation} a^{(2)} = ReLU\qty(W^{(2)} a^{(1)} + b^{(2)}) \end{equation}

and so on…

\begin{equation} a^{(r-1)} = \text{ReLU}\qty(W^{(r-1)} a^{(r-2)} + b^{(r-1)}) \end{equation}

\begin{equation} h_{\theta}\qty(x) = W^{r} a^{r-1} + b^{r} \end{equation}

metadata

total number of neurons: \(m_1 + m_2 + … + m_{r}\)
number of parameters: \(\qty(d+1) m_1 + \qty(m_{1}+1)m_{2} + … + \qty(m_{r-1}+1)m_{r}\)

additional information

Neural networks admit a local optima, and we cannot find a global optima.

neuron

Consider first a single neuron neural network in one dimension. For instance, let’s think of a slightly non-linear case first:

\begin{align} h_{\theta}\qty(x) &= \max \qty(wx+b, 0) \end{align}

it admits two parameters, \(\theta = \qty(w, b) \in \mathbb{R}^{2}\). Such a function is called relu function. What if we have multiple input features? Consider: \(x \in \mathbb{R}^{d}\), \(w \in \mathbb{R}^{d}\), and \(b \in \mathbb{R}\). Now:

\begin{equation} h_{\theta} \qty(x) = \text{ReLU}\qty(w^{\top}x + b) \end{equation}

We refer to relu function as an “Activation Function”.

neurons

We can write latent units in terms of the input units, as well as parameters to weight them; for instance:

\begin{equation} a_1 = \text{ReLU}\qty(\theta_{1} x_1 + \theta_{2} x_2 + \theta_{3}) \end{equation}

Instead of writing this directly, we can just make every neuron connected to every other neuron, resulting in:

\begin{equation} a_1 = \text{ReLU}\qty(w_1^{T} x + b_1) \end{equation}

\begin{equation} a_2 = \text{ReLU}\qty(w_2^{T} x + b_2) \end{equation}

and so on.

why would the neurons learn different things?

Because random initializations will give local minims, and each node would specialize.

some activation functions

see Activation Function

kernel methods vs deep learning

Instead of using Kernel Trick and feature map to extract features yourselves, deep learning promises to learn the correct feature map after multiple, non-linear layers.

Consider \(\beta\) as the parameters of a fully-connected neural network, then the final hypothesis function of a neural network is:

\begin{equation} h_{\theta} = W^{r} \phi_{\beta} \qty(x) + b^{( r)} \end{equation}

in some sense, the entire damn neural network is a feature map for the final, linear output head. We can therefore think of training a neural network as automatically finding a feature map \(\phi_{\beta}\), as well as learning a classifier for that feature map.

two layers

constituents

requirements

vectorized two-layer

constituents

requirements

multi-layer

metadata

additional information

neuron

neurons

why would the neurons learn different things?

some activation functions

see also Neural Networks

kernel methods vs deep learning