Kernel Trick

constituents

  • samples \(x,z \in \mathcal{D}\)
  • \(d\) input dimensions, \(p\) featured dimensions
  • feature map: \(\phi: \mathbb{R}^{d} \to \mathbb{R}^{p}\)
  • kernel: \(k\qty(x,z) = \langle\phi\qty(x), \phi\qty(z)\rangle\)

requirements

train

Compute \(k\qty(x^{(i)}, x^{(j)})\) for all \(i,j \in 1 … n\). Set \(\beta_{j} = 0\) for all \(j\). Iterate:

\begin{equation} \beta_{i} = \beta_{i} + \alpha \qty[y^{(i)} - \sum_{j=1}^{n} \beta_{j} k\qty(x^{(i)}, x^{(j)})] \end{equation}

test

Given \(x\), compute \(\theta^{T} \phi\qty(x)\) such that:

\begin{align} \theta^{T}\phi\qty(x) &= \sum_{i=1}^{n} \beta_{i} \phi\qty(x^{(i)})^{T} \phi\qty(x) \\ &= \sum_{i=1}^{n} \beta k\qty(x^{(i)}, x) \end{align}

takes \(O\qty(nd)\) time.

additional information

kernel smoothing

see kernel smoothing

some useful kernels

monomials order up to \(t\)

we have:

\begin{equation} k\qty(x,z) = \qty(x^{T}z + c)^{t} \end{equation}

which gives a feature map of all monomials \(x_{i}, x_{ij}, x_{ijk} \ldots\) which gives \(\mqty(d+t \\ t)\) dimension feature space. For \(\qty(x^{T}z+c)^{2}\), for instance, we have:

\begin{equation} \phi\qty(x) = \mqty(c \\ \sqrt{2c} x_1 \\ \dots \\ x_{d} x_{d}) \end{equation}

which you can figure by FOILing the kernel function and trying to finagle a way to get each term.

gaussian kernel

see gaussian kernel, which has:

\begin{equation} k\qty(x,z) = \exp \qty(\frac{- \norm{x-z}^{2}_{2}}{2\sigma^{2}}) \end{equation}

what makes a valid kernel?

This is a necessary and sufficient condition:

Suppose there being \(n\) data points, \(x^{(i)}…x^{(n)}\). Consider a structure called kernel matrix: \(K \in \mathbb{R}^{n \times n}\), such that \(k_{ij} = k\qty(x^{(i)}, x^{(j)})\). A valid kernel needs that \(K \succ 0\), namely that \(K\) is positive semi-definite.

Consider what that would mean:

\begin{align} z^{T} k z &= \sum_{i=1}^{n} \sum_{j=1}^{n} z_{i} k_{ij} z_{j} \\ &= \sum_{i=1}^{n} \sum_{j=1}^{n} z_{i} \phi\qty(x^{(i)})^{T} \phi\qty(x^{(i)}) z_{j} \\ &= \sum_{i=1}^{n} \sum_{j=1}^{n}z_{i} \sum_{l}^{} \qty(\phi_{l}\qty(x^{(i)}) \phi_{l}\qty(x^{(j)})) z_{j} \\ &= \sum_{l}^{} \qty(\sum_{i}^{}\qty(z_{i}\phi_{l}\qty(x^{(i)}))^{2}) \geq 0 \end{align}

This gives us that \(K\) being PSD is a necessary condition for being a kernel. What about sufficiency? We will proof by divine intervention:

\(k\) is a valid kernel IFF for all \(n < \infty\), we have for all \(x^{(1)} … x^{(n)}\), the corresponding kernel matrix is PSD.

deriving the kernel trick

Consider now the update rule for \(\beta\) from above:

\begin{equation} \beta_{i} = \beta_{i} + \alpha \qty(y^{(i)} - \theta^{T} \phi\qty(x^{(i)})) \end{equation}

plugging in the definition of \(\theta\) in terms of \(\beta\) above, we obtain:

\begin{align} \beta_{i} &= \beta_{i} + \alpha \qty(y^{(i)} - \theta^{T} \phi\qty(x^{(i)})) \\ &= \beta_{i} + \alpha \qty(y^{(i)} - \qty(\sum_{j=1}^{n} \beta_{j} \phi\qty(x^{(i)})))\phi\qty(x^{(i)}) \\ &= \beta_{i} + \alpha \qty(y^{(i)} - \sum_{j=1}^{n} \beta_{j}\left\langle \phi \qty(x^{(j)}), \phi\qty(x^{(i)}) \right\rangle) \end{align}

Which gives us two observations:

  1. \(\langle \phi \qty(x^{(i)}), \phi \qty(x^{(j)}) \rangle\), \(\forall i,j\) can be precomputed
  2. often computing these inner products can be much faster than \(O\qty(p)\)

efficiently computing the kernel \(\langle \phi \qty(x^{(i)}), \phi \qty(x^{(j)}) \rangle\)

Consider the following scheme to feature engineer your data:

\begin{equation} \phi \qty(x) = \mqty[1 \\ x_1 \\ \dots \\ x_{d} \\ x_1 x_1 \\ x_1 x_2 \\ \dots \\ x_1x_1x_1 \\ x_1x_1x_2 \\ \dots \\ x_{d} x_{d} x_{d}] \end{equation}

How would we commute the inner productive above? Naively, for two data samples \(x,z\):

\begin{equation} \langle \phi\qty(x), \phi \qty(z) \rangle = 1 + \sum_{i=1}^{d} x_{i} z_{i} + \sum_{i=1}^{d} \sum_{j=1}^{d} x_{i}x_{j}z_{i}z_{j}+ \sum_{i=1}^{d} \sum_{j=1}^{d} \sum_{k=1}^{d} x_{i}x_{j}x_{k} z_{i}z_{j}z_{k} \end{equation}

You will notice:

\begin{equation} \langle \phi\qty(x), \phi \qty(z) \rangle = 1 + \sum_{i=1}^{d} x_{i} z_{i} + \underbrace{\sum_{i=1}^{d} \sum_{j=1}^{d} x_{i}x_{j}z_{i}z_{j}}_{\langle x,z \rangle^{2}}+ \underbrace{\sum_{i=1}^{d} \sum_{j=1}^{d} \sum_{k=1}^{d} x_{i}x_{j}x_{k} z_{i}z_{j}z_{k}}_{\langle x,z \rangle^{3}} \end{equation}

So we can finally write:

\begin{equation} \langle \phi\qty(x), \phi \qty(z) \rangle = 1 + \langle x,z \rangle + \langle x,z \rangle^{2} + \langle x,z \rangle^{3} \end{equation}

which is better. As opposed to the naive \(O\qty(np)\) solution for linear regression, we have \(O\qty(n^{2})\) per iteration and \(O\qty(n^{2}d)\) for precomputation of the inner products above. This is nice because \(p\) maybe really bigger than \(n\).

motivation

Suppose you have some kind of non-linear (say exponential) dataset… Consider now Linear Regression with a feature map

\begin{equation} \theta := \theta + \alpha \sum_{i=1}^{n} \qty(y^{(i)} - \theta^{T}\phi\qty(x^{(i)})) \phi \qty(x^{(i)}) \end{equation}

NOTICE that \(\theta\) is a linear combination of $φ\qty(x(i))$s; which we can write somewhat as:

\begin{equation} \theta = \sum_{i=1}^{n} \beta_{i} \phi\qty(x^{(i)}) \end{equation}

observe: this is \(n\) variables in \(\beta_{j}\) instead of \(p\) variables. This is nice because if \(p\) was large feature map \(\phi\) could get really expensive.

Substituting the above in for \(\theta\), we obtain:

THIS IS PROBABLY BAD AND WRONg

\begin{equation} \theta = \sum_{i=1}^{n} \underbrace{\qty(\beta_{i} + \alpha\qty(y^{(i)-\theta^{T}\phi\qty(x^{(i)})}))}_{\text{new $\beta_i$}} \phi\qty(x^{(i)}) \end{equation}