ICLR2025 Index
Last edited: August 8, 2025Sessions
- ICLR2025 Keynote
- ICLR2025 Adaptive Computation
- ICLR2025 Tokenizer-Free Approaches
- ICLR2025 Context and Retrieval
- ICLR2025 MoE
- ICLR2025 HAIC
Posters
Surprises/Takes
- best safety is to actually unlearn the danger
- LLMs: averages representations; Robotics: current point planning
- substitute data for understanding
ICLR2025 Jin: MOE++ zero computation experts
Last edited: August 8, 2025Motivation
A fixed amount of experts is activated per task.
Key Insight
MoE++ allows the amount of expert distribution to be adaptive.
Method
Three key contributions:
- zero-computation experts: discarding input \(E\qty(x) = 0\), copy input \(E\qty(x) = x\) (“skip”), const \(E(x) = \alpha_{a} x +\alpha_{b} v_{\theta}\) (plus normallFFN experts)
- pathway-aware router (with additional loss augmentation where we learn a \(\tau_{\theta}\) to decide
- something else I missed
zero-computation experts
- simple to handle easy tokens quickly
- new experts is relatively low cost
ICLR2025 Kilani: MrT5 Tokenizer-Free
Last edited: August 8, 2025Motivation
ByteT5 is very expensive (because you have to have a residual on every damn token)
MrT5
MrT5 uses a soft attention masking gate at pretraining time to delete unused tokens; at inference time we use a hard cut.
Cool: MrT5 learns language independent compression rate (different languages have different rates).
ICLR2025 Li: MoE is secretly an embedding
Last edited: August 8, 2025motivation
Can we directly extract embeddings from MoE forwarding routing weights (i.e., compared to traditional residual stream information)?
Key Insight
Using residual states vs. forwarding weights as semantic searc embeddings offer complementary strengths (i.e., when one method fails, the other one succeeds more)
Method
Create an aggregate embedding:
\begin{equation} E_{j} = X_{j} + \alpha W_{j} \end{equation}
where \(W_{j}\) is the routing weight of the residual, and \(X_{j}\) is the residual.
ICLR2025 Mathur: MIND Adaptive Thinking with Dynamic Computation
Last edited: August 8, 2025Motivation
Standard computation doesn’t adapt.
Fixed-Point Iteration for Adaptation
method: CNN
- for every layer, perform fixed-point iteration until convergence to mask out (what exactly?)
- supervise also an “introspection model” to skip the entire fixed point
- loss: LM + supervision for the introspection model
method: MIND-transformer
- for every layer, perform fixed-point iteration until attention activation convergence
- ditto introspection as above