One-Liner
“Patches in groups of tokenization scale better than tokens”
Motivation / Novelty
- typical byte-level LMs don’t are very expensive because many tokens
- its hard to go beyond 4-6 bytes per token: Zipf’s Law
- so, we model them as token patches
Notable Methods
token patch
“how do we segment the byte sequence into patches?” — insight: group predicable tokens after every hard choice! i.e., once you train a model, there are “obvious”
patcher and unpatcher cross attend