supervised learning

Supervised learning (also known as behavioral cloning) if the agent is learning what to do in an observe-act cycle) is a type of decision making method.

constituents

input space: \(\mathcal{X}\)
output space: \(\mathcal{Y}\)
hypothesis/model/prediction: \(h : \mathcal{X} \to \mathcal{Y}\)

requirements

Our ultimate goal is to learn a good model \(h\) from the training set:

what “good” means is hard to define
we generally want to use the model on new data, not just the training set

continuous \(\mathcal{Y}\) is then called a regression problem; discrete \(\mathcal{Y}\) is called a classification problem.

That is, we want our hypothesis to behave as \(h_{\theta}\qty (x^{(i)}) \approx y^{(i)}\).

additional information

training set

The training set is a set of pairs:

\begin{equation} \qty {\qty(x^{(1)}, y^{(1)}) \dots \qty (x^{(n)}, y^{(n)})} \end{equation}

such that \(x^{(j)} \in \mathcal{X}, y^{(j)} \in \mathcal{Y}\).

We call \(n\) the training set size.

main procedure

provide the agent with some examples
use an automated learning algorithm to generalize from the example

This is good for typically representative situations, but if you are throwing an agent into a completely unfamiliar situation, supervised learning cannot perform better.

Disadvantages

the labeled data is finite
limited by the quality of performance in the training data
interpolation between states are finite

cost function

see cost function

evaluation

see machine learning evaluation

linear classification is convex

We can separate two sets of points \(\qty {x_1 \dots x_{N}}, \qty {y_1 \dots y_{M}}\) by a hyperplane. That is, we can:

find \(a \in \mathbb{R}^{N}, b \in \mathbb{R}\) with \(a^{T}x_{i} + b > 0, i = 1 … N\), \(a^{T}y_{i} + b < 0, i = 1 … M\).

This is homogeneous in \(a,b\) thus we can shift the boundary around:

find \(a \in \mathbb{R}^{N}, b \in \mathbb{R}\) with \(a^{T}x_{i} + b \geq 1, i = 1 … N\), \(a^{T}y_{i} + b \leq -1, i = 1 … M\).

thus we have obtained an LP feasibility problem.

robust linear discrimination

Maybe you want your hyperplanes to be separated by some distance:

\begin{align} \mathcal{H}_{1} = \qty {z \mid a^{T}z + b = 1} \end{align}

\begin{align} \mathcal{H}_{2} = \qty {z \mid a^{T}z + b = -1} \end{align}

the distance between them is then \(\frac{2}{\norm{a}_{2}}\); hence to separate points by maximum margin:

\begin{align} \min_{a,b}\quad & \qty(\frac{1}{2})\norm{a}_{2}^{2} \\ \textrm{s.t.} \quad & a^{T}x_{i} + b \geq 1, i = 1 \dots N \\ & a^{T}y_{i} + b \leq -1, i = 1 \dots M \end{align}

approximate linear separation

“minimize misclassification”

\begin{align} \min_{u,v,a,b}\quad & 1^{T}u + 1^{T}v \\ \textrm{s.t.} \quad & a^{T} x_{i} + b \geq 1 - u_{i} \\ & a^{T}y_{i} + b \leq -1 + v_{i} \\ & u \geq 0 \\ & v \geq 0 \end{align}

interpretation: \(u_{i}\) is the “extra slack” you are giving and we want as little of it as possible.