Posts

ACL2025 Keynote: Luke Zettemoyer

Last edited: August 8, 2025

Naively: “almost everything comes from pretraining.” How much simple supervision will it radically change the behavior of our language model.

Key Directions

  1. data long-tail: tokenizer free LLMs
  2. data modules: how to we specialize quickly?

Tokenizer-Free LM

Byte-Level LMs are just more expensive (i.e., there is just a bunch more residual streams! and that’s pretty bad). High level intution: takes the input bytes, create some “strides”/“patches”, and then send the patches through a transformer, and then unpatch

ACL2025 Li: TokAlign Token Alignment

Last edited: August 8, 2025

Method to adapt tokenization across models.

Notable Methods

  1. use pairwise cosine similarity between token embeddings to create a grid of alignment
  2. initialize new adapted embeddings for each id’s most similar tokens
  3. tune

ACL2025 Monday Morning Posters

Last edited: August 8, 2025

ACL2025 Zhang: FaithfulRAG: Fact level conflict modeling

Key insight: RAG performance degrades wen model has context and parametric knowledge mismatch, identifying those and use three step iterative method to improve context faithfulness.

ACL2025 Ding: LLM reasoning capability via scalable question synthesis

Key insight: generate free-from questions conditioned only in BOS, then distill and DPO to get a nice question generation dataset and directly fine tune

ACL2025 Wen: synthetic data strategy on domain specific retrieval

Key insight: train your models enough to memorize the context of a specific domain and therefore be able to recall better in particular using document based IDs

ACL2025 Orals: Efficient NLP

Last edited: August 8, 2025