Kernel Trick

constituents

samples $x,z \in \mathcal{D}$
$d$ input dimensions, $p$ featured dimensions
feature map: $\phi: \mathbb{R}^{d} \to \mathbb{R}^{p}$
kernel: $k\qty(x,z) = \langle\phi\qty(x), \phi\qty(z)\rangle$

requirements

train

Compute $k\qty(x^{(i)}, x^{(j)})$ for all $i,j \in 1 … n$. Set $\beta_{j} = 0$ for all $j$. Iterate:

\begin{equation} \beta_{i} = \beta_{i} + \alpha \qty[y^{(i)} - \sum_{j=1}^{n} \beta_{j} k\qty(x^{(i)}, x^{(j)})] \end{equation}

test

Given $x$, compute $\theta^{T} \phi\qty(x)$ such that:

\begin{align} \theta^{T}\phi\qty(x) &= \sum_{i=1}^{n} \beta_{i} \phi\qty(x^{(i)})^{T} \phi\qty(x) \\ &= \sum_{i=1}^{n} \beta k\qty(x^{(i)}, x) \end{align}

takes $O\qty(nd)$ time.

additional information

some useful kernels

monomials order up to $t$

we have:

\begin{equation} k\qty(x,z) = \qty(x^{T}z + c)^{t} \end{equation}

which gives a feature map of all monomials $x_{i}, x_{ij}, x_{ijk} \ldots$ which gives $\mqty(d+t \\ t)$ dimension feature space. For $\qty(x^{T}z+c)^{2}$, for instance, we have:

\begin{equation} \phi\qty(x) = \mqty(c \\ \sqrt{2c} x_1 \\ \dots \\ x_{d} x_{d}) \end{equation}

which you can figure by FOILing the kernel function and trying to finagle a way to get each term.

gaussian kernel

see gaussian kernel, which has:

\begin{equation} k\qty(x,z) = \exp \qty(\frac{- \norm{x-z}^{2}_{2}}{2\sigma^{2}}) \end{equation}

what makes a valid kernel?

This is a necessary and sufficient condition:

Suppose there being $n$ data points, $x^{(i)}…x^{(n)}$. Consider a structure called kernel matrix: $K \in \mathbb{R}^{n \times n}$, such that $k_{ij} = k\qty(x^{(i)}, x^{(j)})$. A valid kernel needs that $K \succ 0$, namely that $K$ is positive semi-definite.

Consider what that would mean:

\begin{align} z^{T} k z &= \sum_{i=1}^{n} \sum_{j=1}^{n} z_{i} k_{ij} z_{j} \\ &= \sum_{i=1}^{n} \sum_{j=1}^{n} z_{i} \phi\qty(x^{(i)})^{T} \phi\qty(x^{(i)}) z_{j} \\ &= \sum_{i=1}^{n} \sum_{j=1}^{n}z_{i} \sum_{l}^{} \qty(\phi_{l}\qty(x^{(i)}) \phi_{l}\qty(x^{(j)})) z_{j} \\ &= \sum_{l}^{} \qty(\sum_{i}^{}\qty(z_{i}\phi_{l}\qty(x^{(i)}))^{2}) \geq 0 \end{align}

This gives us that $K$ being PSD is a necessary condition for being a kernel. What about sufficiency? We will proof by divine intervention:

$k$ is a valid kernel IFF for all $n < \infty$, we have for all $x^{(1)} … x^{(n)}$, the corresponding kernel matrix is PSD.

deriving the kernel trick

Consider now the update rule for $\beta$ from above:

\begin{equation} \beta_{i} = \beta_{i} + \alpha \qty(y^{(i)} - \theta^{T} \phi\qty(x^{(i)})) \end{equation}

plugging in the definition of $\theta$ in terms of $\beta$ above, we obtain:

\begin{align} \beta_{i} &= \beta_{i} + \alpha \qty(y^{(i)} - \theta^{T} \phi\qty(x^{(i)})) \\ &= \beta_{i} + \alpha \qty(y^{(i)} - \qty(\sum_{j=1}^{n} \beta_{j} \phi\qty(x^{(i)})))\phi\qty(x^{(i)}) \\ &= \beta_{i} + \alpha \qty(y^{(i)} - \sum_{j=1}^{n} \beta_{j}\left\langle \phi \qty(x^{(j)}), \phi\qty(x^{(i)}) \right\rangle) \end{align}

Which gives us two observations:

$\langle \phi \qty(x^{(i)}), \phi \qty(x^{(j)}) \rangle$, $\forall i,j$ can be precomputed
often computing these inner products can be much faster than $O\qty(p)$

efficiently computing the kernel $\langle \phi \qty(x^{(i)}), \phi \qty(x^{(j)}) \rangle$

Consider the following scheme to feature engineer your data:

\begin{equation} \phi \qty(x) = \mqty[1 \\ x_1 \\ \dots \\ x_{d} \\ x_1 x_1 \\ x_1 x_2 \\ \dots \\ x_1x_1x_1 \\ x_1x_1x_2 \\ \dots \\ x_{d} x_{d} x_{d}] \end{equation}

How would we commute the inner productive above? Naively, for two data samples $x,z$:

\begin{equation} \langle \phi\qty(x), \phi \qty(z) \rangle = 1 + \sum_{i=1}^{d} x_{i} z_{i} + \sum_{i=1}^{d} \sum_{j=1}^{d} x_{i}x_{j}z_{i}z_{j}+ \sum_{i=1}^{d} \sum_{j=1}^{d} \sum_{k=1}^{d} x_{i}x_{j}x_{k} z_{i}z_{j}z_{k} \end{equation}

You will notice:

\begin{equation} \langle \phi\qty(x), \phi \qty(z) \rangle = 1 + \sum_{i=1}^{d} x_{i} z_{i} + \underbrace{\sum_{i=1}^{d} \sum_{j=1}^{d} x_{i}x_{j}z_{i}z_{j}}_{\langle x,z \rangle^{2}}+ \underbrace{\sum_{i=1}^{d} \sum_{j=1}^{d} \sum_{k=1}^{d} x_{i}x_{j}x_{k} z_{i}z_{j}z_{k}}_{\langle x,z \rangle^{3}} \end{equation}

So we can finally write:

\begin{equation} \langle \phi\qty(x), \phi \qty(z) \rangle = 1 + \langle x,z \rangle + \langle x,z \rangle^{2} + \langle x,z \rangle^{3} \end{equation}

which is better. As opposed to the naive $O\qty(np)$ solution for linear regression, we have $O\qty(n^{2})$ per iteration and $O\qty(n^{2}d)$ for precomputation of the inner products above. This is nice because $p$ maybe really bigger than $n$.

motivation

Suppose you have some kind of non-linear (say exponential) dataset… Consider now Linear Regression with a feature map

\begin{equation} \theta := \theta + \alpha \sum_{i=1}^{n} \qty(y^{(i)} - \theta^{T}\phi\qty(x^{(i)})) \phi \qty(x^{(i)}) \end{equation}

NOTICE that $\theta$ is a linear combination of $φ\qty(x⁽ⁱ⁾)$s; which we can write somewhat as:

\begin{equation} \theta = \sum_{i=1}^{n} \beta_{i} \phi\qty(x^{(i)}) \end{equation}

observe: this is $n$ variables in $\beta_{j}$ instead of $p$ variables. This is nice because if $p$ was large feature map $\phi$ could get really expensive.

Substituting the above in for $\theta$, we obtain:

THIS IS PROBABLY BAD AND WRONg

\begin{equation} \theta = \sum_{i=1}^{n} \underbrace{\qty(\beta_{i} + \alpha\qty(y^{(i)-\theta^{T}\phi\qty(x^{(i)})}))}_{\text{new $\beta_i$}} \phi\qty(x^{(i)}) \end{equation}