boosting allows combining very simple models to boost results.
additional information
example
for instance, we can ensemble four decision trees together. You can weight the prediction of each of the trees, and then add them up, and then run some decision boundary over them (i.e. for instance if the output is boolean, you can multiply the boolean as \(\pm 1\) multiplied by a weight)
high level ideas
- add more features (i.e. extract more, increase model complexity)
- add more weak learners together
AdaBoost
- initialize weights \(\alpha_{i} = \frac{1}{N}\)
- for \(t \in 1 … T\)
- learn this-round classifier \(f_{t}\qty(x)\) on data weights \(\alpha_{i}\)
- recompute classifier coefficient \(\hat{w}_{t}\)
- recompute weights \(\alpha_{i}\)
- normalize weights \(\alpha_{i} = \frac{\alpha_{i}}{\sum_{j=1}^{N} \alpha_{j}}\)
- final model predicts via \(\hat{y} = \text{sign}\qty(\sum_{t=1}^{T} \hat{w}_{t}f_{t}\qty(x))\) for classifier with \(T\) rounds
classifier weighting
If \(f_{t}\qty(x)\) is good, then set \(\hat{w}_{t}\) to be large; otherwise, set \(\hat{w}_{t}\) to be small. We can write:
\begin{equation} \hat{w}_{t} = \frac{1}{2} \ln \qty( \frac{1- \text{weighted\_error}\qty(f_{t})}{\text{weighted\_error}\qty(f_{t})} ) \end{equation}
This expression has the following properties: a great classifier is upweighted, a random classifier has weight that tends to zero, and a terrible classifier is automatically inverted. Notice that also if we have \(0\) weighted error, \(\hat{w}_{t} \to \infty\), which is nice.
Classifier Quality | Weight |
---|---|
good | positive high |
random | zero |
bad (inversely good) | negative high |
sample weighting
\begin{equation} \alpha_{i} = \begin{cases} \alpha_{i} e^{-\hat{w}_{t}}, \text{if correct}\\ \alpha_{i} e^{\hat{w}_{t}}, \text{if mistake} \end{cases} \end{equation}
The intuition:
Sample Correct | Classifier Quality | Weight |
---|---|---|
yes | good | decrease |
yes | random | stays the same |
yes | bad | increase |
no | good | increase |
no | random | stays the same |
no | bad | decrease |
after every iteration, we normalize weights to 1 $αi = $.
weighting?
- gradient descent: should weight contribution \(i\) by \(\alpha_{i}\) in gradient.
- decision trees: weight the classification error on point \(i\) by \(\alpha_{i}\)