ICLR2025 Mathur: MIND Adaptive Thinking with Dynamic Computation
Last edited: August 8, 2025Motivation
Standard computation doesn’t adapt.
Fixed-Point Iteration for Adaptation
method: CNN
- for every layer, perform fixed-point iteration until convergence to mask out (what exactly?)
- supervise also an “introspection model” to skip the entire fixed point
- loss: LM + supervision for the introspection model
method: MIND-transformer
- for every layer, perform fixed-point iteration until attention activation convergence
- ditto introspection as above
ICLR2025 MoE
Last edited: August 8, 2025Talks
ICLR2025 Neitemeier: Hierachical Autoregressive Transformers
Last edited: August 8, 2025“A Byte Level transformer, with some compression”
Key insight: use a [CLS] token in front of every word to train a small “tokenizer”, and then do a normal transformer on the [CLS] tokens, and then autoregressive decode out the single bytes.
Method
Hierarchical Autoregressive Transformers
We put a [cls] in front of every word. So the input looks like
[CLS] M y _ [CLS] n a m e _ [CLS] i s
We then run a small encoder over each sequence. And then you take the encoded [CLS], and run
ICLR2025 Saturday Posters
Last edited: August 8, 2025ICLR2025 Cassidy: AssistanceZero
- Train reward predictor to also have rewards at test time
- MCTS
- Learn to match root node KL
ICLR2025 Liu: synthesizing programmatic reinforcement learning policies with LLM guided search
Hill climbing with partial mutations of generated programs of LLMs
ICLR2025 Weller: l PromptTrirver
??
ICLR2025 Yu: robust LLM safeguard via refusal feature adversarial training
With mechanistic interpretability, we can find a sub space which is correlated with refusal, pull that up
ICLR2025 Snell: Optimality of Scaling LLM Test-Time Compute
Last edited: August 8, 2025Compute-Optimal Scaling
Compute-Optimal Scaling is the notion of selecting the optimal configuration (beam width, search budget, etc.) dynamically / for binned question.
Approaches to “Scaling Test-Time Compute”
Three primary approaches:
- best-of-n: roll out a bunch, reject
- Beam Search: check against intermediate
- lookahead search: MCTSish (do lookahead rollouts)
Key insight
- On easy qusetion, beam search shows over-optimization and best of n is good
- on medium/hard questions, beam search is better
Lookahead seems bad?
