constituents
- samples \(x,z \in \mathcal{D}\)
- \(d\) input dimensions, \(p\) featured dimensions
- feature map: \(\phi: \mathbb{R}^{d} \to \mathbb{R}^{p}\)
- kernel: \(k\qty(x,z) = \langle\phi\qty(x), \phi\qty(z)\rangle\)
requirements
train
Compute \(k\qty(x^{(i)}, x^{(j)})\) for all \(i,j \in 1 … n\). Set \(\beta_{j} = 0\) for all \(j\). Iterate:
\begin{equation} \beta_{i} = \beta_{i} + \alpha \qty[y^{(i)} - \sum_{j=1}^{n} \beta_{j} k\qty(x^{(i)}, x^{(j)})] \end{equation}
test
Given \(x\), compute \(\theta^{T} \phi\qty(x)\) such that:
\begin{align} \theta^{T}\phi\qty(x) &= \sum_{i=1}^{n} \beta_{i} \phi\qty(x^{(i)})^{T} \phi\qty(x) \\ &= \sum_{i=1}^{n} \beta k\qty(x^{(i)}, x) \end{align}
takes \(O\qty(nd)\) time.
additional information
kernel smoothing
see kernel smoothing
some useful kernels
monomials order up to \(t\)
we have:
\begin{equation} k\qty(x,z) = \qty(x^{T}z + c)^{t} \end{equation}
which gives a feature map of all monomials \(x_{i}, x_{ij}, x_{ijk} \ldots\) which gives \(\mqty(d+t \\ t)\) dimension feature space. For \(\qty(x^{T}z+c)^{2}\), for instance, we have:
\begin{equation} \phi\qty(x) = \mqty(c \\ \sqrt{2c} x_1 \\ \dots \\ x_{d} x_{d}) \end{equation}
which you can figure by FOILing the kernel function and trying to finagle a way to get each term.
gaussian kernel
see gaussian kernel, which has:
\begin{equation} k\qty(x,z) = \exp \qty(\frac{- \norm{x-z}^{2}_{2}}{2\sigma^{2}}) \end{equation}
what makes a valid kernel?
This is a necessary and sufficient condition:
Suppose there being \(n\) data points, \(x^{(i)}…x^{(n)}\). Consider a structure called kernel matrix: \(K \in \mathbb{R}^{n \times n}\), such that \(k_{ij} = k\qty(x^{(i)}, x^{(j)})\). A valid kernel needs that \(K \succ 0\), namely that \(K\) is positive semi-definite.
Consider what that would mean:
\begin{align} z^{T} k z &= \sum_{i=1}^{n} \sum_{j=1}^{n} z_{i} k_{ij} z_{j} \\ &= \sum_{i=1}^{n} \sum_{j=1}^{n} z_{i} \phi\qty(x^{(i)})^{T} \phi\qty(x^{(i)}) z_{j} \\ &= \sum_{i=1}^{n} \sum_{j=1}^{n}z_{i} \sum_{l}^{} \qty(\phi_{l}\qty(x^{(i)}) \phi_{l}\qty(x^{(j)})) z_{j} \\ &= \sum_{l}^{} \qty(\sum_{i}^{}\qty(z_{i}\phi_{l}\qty(x^{(i)}))^{2}) \geq 0 \end{align}
This gives us that \(K\) being PSD is a necessary condition for being a kernel. What about sufficiency? We will proof by divine intervention:
\(k\) is a valid kernel IFF for all \(n < \infty\), we have for all \(x^{(1)} … x^{(n)}\), the corresponding kernel matrix is PSD.
deriving the kernel trick
Consider now the update rule for \(\beta\) from above:
\begin{equation} \beta_{i} = \beta_{i} + \alpha \qty(y^{(i)} - \theta^{T} \phi\qty(x^{(i)})) \end{equation}
plugging in the definition of \(\theta\) in terms of \(\beta\) above, we obtain:
\begin{align} \beta_{i} &= \beta_{i} + \alpha \qty(y^{(i)} - \theta^{T} \phi\qty(x^{(i)})) \\ &= \beta_{i} + \alpha \qty(y^{(i)} - \qty(\sum_{j=1}^{n} \beta_{j} \phi\qty(x^{(i)})))\phi\qty(x^{(i)}) \\ &= \beta_{i} + \alpha \qty(y^{(i)} - \sum_{j=1}^{n} \beta_{j}\left\langle \phi \qty(x^{(j)}), \phi\qty(x^{(i)}) \right\rangle) \end{align}
Which gives us two observations:
- \(\langle \phi \qty(x^{(i)}), \phi \qty(x^{(j)}) \rangle\), \(\forall i,j\) can be precomputed
- often computing these inner products can be much faster than \(O\qty(p)\)
efficiently computing the kernel \(\langle \phi \qty(x^{(i)}), \phi \qty(x^{(j)}) \rangle\)
Consider the following scheme to feature engineer your data:
\begin{equation} \phi \qty(x) = \mqty[1 \\ x_1 \\ \dots \\ x_{d} \\ x_1 x_1 \\ x_1 x_2 \\ \dots \\ x_1x_1x_1 \\ x_1x_1x_2 \\ \dots \\ x_{d} x_{d} x_{d}] \end{equation}
How would we commute the inner productive above? Naively, for two data samples \(x,z\):
\begin{equation} \langle \phi\qty(x), \phi \qty(z) \rangle = 1 + \sum_{i=1}^{d} x_{i} z_{i} + \sum_{i=1}^{d} \sum_{j=1}^{d} x_{i}x_{j}z_{i}z_{j}+ \sum_{i=1}^{d} \sum_{j=1}^{d} \sum_{k=1}^{d} x_{i}x_{j}x_{k} z_{i}z_{j}z_{k} \end{equation}
You will notice:
\begin{equation} \langle \phi\qty(x), \phi \qty(z) \rangle = 1 + \sum_{i=1}^{d} x_{i} z_{i} + \underbrace{\sum_{i=1}^{d} \sum_{j=1}^{d} x_{i}x_{j}z_{i}z_{j}}_{\langle x,z \rangle^{2}}+ \underbrace{\sum_{i=1}^{d} \sum_{j=1}^{d} \sum_{k=1}^{d} x_{i}x_{j}x_{k} z_{i}z_{j}z_{k}}_{\langle x,z \rangle^{3}} \end{equation}
So we can finally write:
\begin{equation} \langle \phi\qty(x), \phi \qty(z) \rangle = 1 + \langle x,z \rangle + \langle x,z \rangle^{2} + \langle x,z \rangle^{3} \end{equation}
which is better. As opposed to the naive \(O\qty(np)\) solution for linear regression, we have \(O\qty(n^{2})\) per iteration and \(O\qty(n^{2}d)\) for precomputation of the inner products above. This is nice because \(p\) maybe really bigger than \(n\).
motivation
Suppose you have some kind of non-linear (say exponential) dataset… Consider now Linear Regression with a feature map
\begin{equation} \theta := \theta + \alpha \sum_{i=1}^{n} \qty(y^{(i)} - \theta^{T}\phi\qty(x^{(i)})) \phi \qty(x^{(i)}) \end{equation}
NOTICE that \(\theta\) is a linear combination of $φ\qty(x(i))$s; which we can write somewhat as:
\begin{equation} \theta = \sum_{i=1}^{n} \beta_{i} \phi\qty(x^{(i)}) \end{equation}
observe: this is \(n\) variables in \(\beta_{j}\) instead of \(p\) variables. This is nice because if \(p\) was large feature map \(\phi\) could get really expensive.
Substituting the above in for \(\theta\), we obtain:
THIS IS PROBABLY BAD AND WRONg
\begin{equation} \theta = \sum_{i=1}^{n} \underbrace{\qty(\beta_{i} + \alpha\qty(y^{(i)-\theta^{T}\phi\qty(x^{(i)})}))}_{\text{new $\beta_i$}} \phi\qty(x^{(i)}) \end{equation}