Posts

ACL2025 Pagoni: Patches Scale Better Than Tokens

Last edited: August 8, 2025

One-Liner

“Patches in groups of tokenization scale better than tokens”

Motivation / Novelty

  • typical byte-level LMs don’t are very expensive because many tokens
  • its hard to go beyond 4-6 bytes per token: Zipf’s Law
  • so, we model them as token patches

Notable Methods

token patch

“how do we segment the byte sequence into patches?” — insight: group predicable tokens after every hard choice! i.e., once you train a model, there are “obvious”

ACL2025 Tuesday Afternoon Posters

Last edited: August 8, 2025

ACL2025 Wu: RankCoT refining knowledge for retrieval augmented generation through ranking CoT

Key insight: generally a bunch of chain of thoughts, including on irrelevant documents, re-rank using self reflection, and then DPO

ACL2025 Trienes: behavioral analysis of information salience

Key insight: you can ask models for summaries at shorter lengths, which distill what the models think is salient information

ACL2025 Abbes: small encoders can rival large encoders in detecting groundedness

Key insight: apparently groundedness classification doesn’t require that many parameters

ACL2025 Tuesday Morning Posters

Last edited: August 8, 2025

ACL2025 Katz: segment based attention masking

Key insight: allow by directional attention

ACL2025 Monodorf: exploring modular sturctures transformer based language models

Key insight: learn circuit compositions by learning a binary mask for both faithfulness and scarcity

ACL2025 Li: some more samples of next token prediction

Key insight: when there’s a high difference between generation probability and ground truth, those samples when intervene will cause a more dramatic effect

ACL2025 Kim: counterfactual consistency prompting

Key insight: prompt with counter factual for temporal order to be able to be more consistent temporally

ACL2025 Workshop: Web Agents

Last edited: August 8, 2025