Houjun Liu

ACL2025 Pagoni: Patches Scale Better Than Tokens

One-Liner

“Patches in groups of tokenization scale better than tokens”

Motivation / Novelty

  • typical byte-level LMs don’t are very expensive because many tokens
  • its hard to go beyond 4-6 bytes per token: Zipf’s Law
  • so, we model them as token patches

Notable Methods

token patch

“how do we segment the byte sequence into patches?” — insight: group predicable tokens after every hard choice! i.e., once you train a model, there are “obvious”

patcher and unpatcher cross attend

Key Figs

New Concepts

Notes