ACL2025 Pagoni: Patches Scale Better Than Tokens

One-Liner

“Patches in groups of tokenization scale better than tokens”

Motivation / Novelty

typical byte-level LMs don’t are very expensive because many tokens
its hard to go beyond 4-6 bytes per token: Zipf’s Law
so, we model them as token patches

Notable Methods

token patch

“how do we segment the byte sequence into patches?” — insight: group predicable tokens after every hard choice! i.e., once you train a model, there are “obvious”

patcher and unpatcher cross attend

One-Liner

Motivation / Novelty

Notable Methods

token patch

Key Figs

New Concepts

Notes