ACL2025 Orals: QA
Last edited: August 8, 2025ACL2025 Pagoni: Patches Scale Better Than Tokens
Last edited: August 8, 2025One-Liner
“Patches in groups of tokenization scale better than tokens”
Motivation / Novelty
- typical byte-level LMs don’t are very expensive because many tokens
- its hard to go beyond 4-6 bytes per token: Zipf’s Law
- so, we model them as token patches
Notable Methods
token patch
“how do we segment the byte sequence into patches?” — insight: group predicable tokens after every hard choice! i.e., once you train a model, there are “obvious”
ACL2025 Tuesday Afternoon Posters
Last edited: August 8, 2025ACL2025 Wu: RankCoT refining knowledge for retrieval augmented generation through ranking CoT
Key insight: generally a bunch of chain of thoughts, including on irrelevant documents, re-rank using self reflection, and then DPO
ACL2025 Trienes: behavioral analysis of information salience
Key insight: you can ask models for summaries at shorter lengths, which distill what the models think is salient information
ACL2025 Abbes: small encoders can rival large encoders in detecting groundedness
Key insight: apparently groundedness classification doesn’t require that many parameters
ACL2025 Tuesday Morning Posters
Last edited: August 8, 2025ACL2025 Katz: segment based attention masking
Key insight: allow by directional attention
ACL2025 Monodorf: exploring modular sturctures transformer based language models
Key insight: learn circuit compositions by learning a binary mask for both faithfulness and scarcity
ACL2025 Li: some more samples of next token prediction
Key insight: when there’s a high difference between generation probability and ground truth, those samples when intervene will cause a more dramatic effect
ACL2025 Kim: counterfactual consistency prompting
Key insight: prompt with counter factual for temporal order to be able to be more consistent temporally