boosting

boosting allows combining very simple models to boost results.

additional information

example

for instance, we can ensemble four decision trees together. You can weight the prediction of each of the trees, and then add them up, and then run some decision boundary over them (i.e. for instance if the output is boolean, you can multiply the boolean as \(\pm 1\) multiplied by a weight)

high level ideas

  1. add more features (i.e. extract more, increase model complexity)
  2. add more weak learners together

AdaBoost

  1. initialize weights \(\alpha_{i} = \frac{1}{N}\)
  2. for \(t \in 1 … T\)
    • learn this-round classifier \(f_{t}\qty(x)\) on data weights \(\alpha_{i}\)
    • recompute classifier coefficient \(\hat{w}_{t}\)
    • recompute weights \(\alpha_{i}\)
    • normalize weights \(\alpha_{i} = \frac{\alpha_{i}}{\sum_{j=1}^{N} \alpha_{j}}\)
  3. final model predicts via \(\hat{y} = \text{sign}\qty(\sum_{t=1}^{T} \hat{w}_{t}f_{t}\qty(x))\) for classifier with \(T\) rounds

classifier weighting

If \(f_{t}\qty(x)\) is good, then set \(\hat{w}_{t}\) to be large; otherwise, set \(\hat{w}_{t}\) to be small. We can write:

\begin{equation} \hat{w}_{t} = \frac{1}{2} \ln \qty( \frac{1- \text{weighted\_error}\qty(f_{t})}{\text{weighted\_error}\qty(f_{t})} ) \end{equation}

This expression has the following properties: a great classifier is upweighted, a random classifier has weight that tends to zero, and a terrible classifier is automatically inverted. Notice that also if we have \(0\) weighted error, \(\hat{w}_{t} \to \infty\), which is nice.

Classifier QualityWeight
goodpositive high
randomzero
bad (inversely good)negative high

sample weighting

\begin{equation} \alpha_{i} = \begin{cases} \alpha_{i} e^{-\hat{w}_{t}}, \text{if correct}\\ \alpha_{i} e^{\hat{w}_{t}}, \text{if mistake} \end{cases} \end{equation}

The intuition:

Sample CorrectClassifier QualityWeight
yesgooddecrease
yesrandomstays the same
yesbadincrease
nogoodincrease
norandomstays the same
nobaddecrease

after every iteration, we normalize weights to 1 $αi = $.

weighting?

  • gradient descent: should weight contribution \(i\) by \(\alpha_{i}\) in gradient.
  • decision trees: weight the classification error on point \(i\) by \(\alpha_{i}\)