ACL2025 Keynote: Luke Zettemoyer
Last edited: August 8, 2025Naively: “almost everything comes from pretraining.” How much simple supervision will it radically change the behavior of our language model.
Key Directions
- data long-tail: tokenizer free LLMs
- data modules: how to we specialize quickly?
Tokenizer-Free LM
Byte-Level LMs are just more expensive (i.e., there is just a bunch more residual streams! and that’s pretty bad). High level intution: takes the input bytes, create some “strides”/“patches”, and then send the patches through a transformer, and then unpatch
ACL2025 Li: TokAlign Token Alignment
Last edited: August 8, 2025Method to adapt tokenization across models.
Notable Methods
- use pairwise cosine similarity between token embeddings to create a grid of alignment
- initialize new adapted embeddings for each id’s most similar tokens
- tune
ACL2025 Monday Morning Posters
Last edited: August 8, 2025ACL2025 Zhang: FaithfulRAG: Fact level conflict modeling
Key insight: RAG performance degrades wen model has context and parametric knowledge mismatch, identifying those and use three step iterative method to improve context faithfulness.
ACL2025 Ding: LLM reasoning capability via scalable question synthesis
Key insight: generate free-from questions conditioned only in BOS, then distill and DPO to get a nice question generation dataset and directly fine tune
ACL2025 Wen: synthetic data strategy on domain specific retrieval
Key insight: train your models enough to memorize the context of a specific domain and therefore be able to recall better in particular using document based IDs