Byte-Pair Encoding

BPE is a common Subword Tokenization scheme.

Training

choose two symbols that are most frequency adjacent
merge those two symbols as one symbol throughout the text
repeat to step \(1\) until we merge \(k\) times

v = set(corpus.characters())
for i in range(k):
    tl, tr = get_most_common_bigram(v)
    tnew = f"{tl}{tr}"
    v.push(tnew)
    corpus.replace((tl,tr), tnew)
return v

Most commonly, BPE is not ran alone: it usually run inside space separation systems. Hence, after each word we usually put a special _ token which delineates end of word.

Hence: “pink fluffy unicorn dancing on rainbows” becomes

p i n k _ f l u f f y _ u n i c o r n _ d a n c i n g _ o n _ r a i n b o w s

Inference

During inference time, we apply our stored merges in the order we learned them. As in, if we merged er first during training, we should do that first during inference before merging say n er.

Frequent subwords often ends up being morphemes.