AdaBoost

initialize weights $\alpha_{i} = \frac{1}{N}$
for $t \in 1 … T$
- learn this-round classifier $f_{t}\qty(x)$ on data weights $\alpha_{i}$
- recompute classifier coefficient $\hat{w}_{t}$
- recompute weights $\alpha_{i}$
- normalize weights $\alpha_{i} = \frac{\alpha_{i}}{\sum_{j=1}^{N} \alpha_{j}}$
final model predicts via $\hat{y} = \text{sign}\qty(\sum_{t=1}^{T} \hat{w}_{t}f_{t}\qty(x))$ for classifier with $T$ rounds

After some iterations. This algorithm only works on binary classification and eventually gets exponential loss / 0/1 loss. gradient boosting will eventually allow a more general form.

convergence theorem

At every $t$, if we can find a weak learner with weighted error < 0.5 for all rounds, then the AdaBoost converges.

The training error of the classifier after training via AdaBoost (left term) is bounded by:

\begin{equation} \frac{1}{N} \sum_{i=1}^{N} 1 \qty {F\qty(x_{i}) \neq y_{i}} \leq \frac{1}{N} \sum_{i=1}^{N} \exp \qty(-y_{i}\text{score}\qty(x_{i})) = \prod_{t=1}^{T} \sum_{i=1}^{N} \alpha_{i,t} \exp \qty(-\hat{w}_{t} y_{i} f_{t}\qty(x_{i})) \end{equation}

where score is the resembling expression: $\text{score}\qty(x) = \sum_{t}^{} \hat{w}_{t}f_{t}\qty(x) ; F\qty(x) = \text{sign}\qty(score\qty(x))$

Sketch:

Notice that $y_{i}score\qty(x_{i})$ is a proxy for whether or not $F\qty(x_{i}) \neq y_{i}$ (the expression is only negative during disagreement)

And so $e^{-y_{i}\text{score}\qty(x_{i})}$ is an exponential loss for the binary classification taskthat goes exponentially high when the scores are really not agreeing in sight with $y_{i}$. This gives a upper bound on the zero-one loss on the left, allowing us to:

in order for us to have a tight error bound, we should therefore minimize the rightmost term as a function of $\hat{w}_{t}$ and $a_{i,t}$. Turns out, if you do that, we get that the optimal $w_{i} = \frac{1}{2} \ln \qty(\frac{1-\varepsilon_{t}}{\varepsilon_{t}})$.

classifier weighting

If $f_{t}\qty(x)$ is good, then set $\hat{w}_{t}$ to be large; otherwise, set $\hat{w}_{t}$ to be small. We can write:

\begin{equation} \hat{w}_{t} = \frac{1}{2} \ln \qty( \frac{1- \text{weighted\_error}\qty(f_{t})}{\text{weighted\_error}\qty(f_{t})} ) \end{equation}

This expression has the following properties: a great classifier is upweighted, a random classifier has weight that tends to zero, and a terrible classifier is automatically inverted. Notice that also if we have $0$ weighted error, $\hat{w}_{t} \to \infty$, which is nice.

Classifier Quality	Weight
good	positive high
random	zero
bad (inversely good)	negative high

sample weighting

\begin{equation} \alpha_{i} = \begin{cases} \alpha_{i} e^{-\hat{w}_{t}}, \text{if correct}\\ \alpha_{i} e^{\hat{w}_{t}}, \text{if mistake} \end{cases} \end{equation}

The intuition:

Sample Correct	Classifier Quality	Weight
yes	good	decrease
yes	random	stays the same
yes	bad	increase
no	good	increase
no	random	stays the same
no	bad	decrease

after every iteration, we normalize weights to 1 $α_i = $.

weighting?

gradient descent: should weight contribution $i$ by $\alpha_{i}$ in gradient.
decision trees: weight the classification error on point $i$ by $\alpha_{i}$