ICLR2025 Friday Posters
Last edited: August 8, 2025ICLR2025 Morris: contextual document embeddings
Take a bunch of sentence embeddings as input to produce a new sentence embedding that is now contextual
ICLR2025 Noukhovich: asynchronous reinforcement learning for language models
Rollout and tune concurrently
ICLR2025 Yao: CR-CTC CONSISTENCY REGULATION
CTC LOSS CAN BE MADE MORE ROBUST IF YOU REGULARIZE TO HAVE MINIMAL DIFFERENCE BETWEEN TWO AUGMENTED VIEWS OF THE SAME MEL SPECTRUM
ICLR2025 Sun: ReDeEP detecting hallucination using mechanistic interpretability
Find layers most prone to insert information, measure the information insertion using logit lens before and after passing through FFN, strong change after hallucination prone FFN means hallucination
ICLR2025 HAIC
Last edited: August 8, 2025ICLR2025 Koyejo
Proposal: Focus AI measurements on the validity of specific terms.
Five pillars of claim making:
- content validity: does your evaluation cover all valuable cases?
- criterion validity: does your evaluation correlate with a known validated standard?
- construct validity: does your evaluation measure the intended construct?
- external validity: does your evaluation generalize across different environments or settings?
- consequential validity: does your evaluation consider the real world impact of test interpretation and use
Open problem: validaty of measurement for claims of HAIC.
ICLR2025 Index
Last edited: August 8, 2025Sessions
- ICLR2025 Keynote
- ICLR2025 Adaptive Computation
- ICLR2025 Tokenizer-Free Approaches
- ICLR2025 Context and Retrieval
- ICLR2025 MoE
- ICLR2025 HAIC
Posters
Surprises/Takes
- best safety is to actually unlearn the danger
- LLMs: averages representations; Robotics: current point planning
- substitute data for understanding
ICLR2025 Jin: MOE++ zero computation experts
Last edited: August 8, 2025Motivation
A fixed amount of experts is activated per task.
Key Insight
MoE++ allows the amount of expert distribution to be adaptive.
Method
Three key contributions:
- zero-computation experts: discarding input \(E\qty(x) = 0\), copy input \(E\qty(x) = x\) (“skip”), const \(E(x) = \alpha_{a} x +\alpha_{b} v_{\theta}\) (plus normallFFN experts)
- pathway-aware router (with additional loss augmentation where we learn a \(\tau_{\theta}\) to decide
- something else I missed
zero-computation experts
- simple to handle easy tokens quickly
- new experts is relatively low cost
ICLR2025 Kilani: MrT5 Tokenizer-Free
Last edited: August 8, 2025Motivation
ByteT5 is very expensive (because you have to have a residual on every damn token)
MrT5
MrT5 uses a soft attention masking gate at pretraining time to delete unused tokens; at inference time we use a hard cut.
Cool: MrT5 learns language independent compression rate (different languages have different rates).
