Supervised learning (also known as behavioral cloning) if the agent is learning what to do in an observe-act cycle) is a type of decision making method.
constituents
- input space: \(\mathcal{X}\)
- output space: \(\mathcal{Y}\)
- hypothesis/model/prediction: \(h : \mathcal{X} \to \mathcal{Y}\)
requirements
Our ultimate goal is to learn a good model \(h\) from the training set:
- what “good” means is hard to define
- we generally want to use the model on new data, not just the training set
continuous \(\mathcal{Y}\) is then called a regression problem; discrete \(\mathcal{Y}\) is called a classification problem.
That is, we want our hypothesis to behave as \(h_{\theta}\qty (x^{(i)}) \approx y^{(i)}\).
additional information
training set
The training set is a set of pairs:
\begin{equation} \qty {\qty(x^{(1)}, y^{(1)}) \dots \qty (x^{(n)}, y^{(n)})} \end{equation}
such that \(x^{(j)} \in \mathcal{X}, y^{(j)} \in \mathcal{Y}\).
We call \(n\) the training set size.
main procedure
- provide the agent with some examples
- use an automated learning algorithm to generalize from the example
This is good for typically representative situations, but if you are throwing an agent into a completely unfamiliar situation, supervised learning cannot perform better.
Disadvantages
- the labeled data is finite
- limited by the quality of performance in the training data
- interpolation between states are finite
cost function
see cost function
evaluation
see machine learning evaluation
linear classification is convex
We can separate two sets of points \(\qty {x_1 \dots x_{N}}, \qty {y_1 \dots y_{M}}\) by a hyperplane. That is, we can:
find \(a \in \mathbb{R}^{N}, b \in \mathbb{R}\) with \(a^{T}x_{i} + b > 0, i = 1 … N\), \(a^{T}y_{i} + b < 0, i = 1 … M\).
This is homogeneous in \(a,b\) thus we can shift the boundary around:
find \(a \in \mathbb{R}^{N}, b \in \mathbb{R}\) with \(a^{T}x_{i} + b \geq 1, i = 1 … N\), \(a^{T}y_{i} + b \leq -1, i = 1 … M\).
thus we have obtained an LP feasibility problem.
robust linear discrimination
Maybe you want your hyperplanes to be separated by some distance:
\begin{align} \mathcal{H}_{1} = \qty {z \mid a^{T}z + b = 1} \end{align}
\begin{align} \mathcal{H}_{2} = \qty {z \mid a^{T}z + b = -1} \end{align}
the distance between them is then \(\frac{2}{\norm{a}_{2}}\); hence to separate points by maximum margin:
\begin{align} \min_{a,b}\quad & \qty(\frac{1}{2})\norm{a}_{2}^{2} \\ \textrm{s.t.} \quad & a^{T}x_{i} + b \geq 1, i = 1 \dots N \\ & a^{T}y_{i} + b \leq -1, i = 1 \dots M \end{align}
approximate linear separation
“minimize misclassification”
\begin{align} \min_{u,v,a,b}\quad & 1^{T}u + 1^{T}v \\ \textrm{s.t.} \quad & a^{T} x_{i} + b \geq 1 - u_{i} \\ & a^{T}y_{i} + b \leq -1 + v_{i} \\ & u \geq 0 \\ & v \geq 0 \end{align}
interpretation: \(u_{i}\) is the “extra slack” you are giving and we want as little of it as possible.
