ICLR2025 Mathur: MIND Adaptive Thinking with Dynamic Computation

Last edited: August 8, 2025

Motivation

Standard computation doesn’t adapt.

Fixed-Point Iteration for Adaptation

method: CNN

for every layer, perform fixed-point iteration until convergence to mask out (what exactly?)
supervise also an “introspection model” to skip the entire fixed point
loss: LM + supervision for the introspection model

method: MIND-transformer

for every layer, perform fixed-point iteration until attention activation convergence
ditto introspection as above

ICLR2025 MoE

Last edited: August 8, 2025

Talks

ICLR2025 Neitemeier: Hierachical Autoregressive Transformers

Last edited: August 8, 2025

“A Byte Level transformer, with some compression”

Key insight: use a [CLS] token in front of every word to train a small “tokenizer”, and then do a normal transformer on the [CLS] tokens, and then autoregressive decode out the single bytes.

Method

Hierarchical Autoregressive Transformers

We put a [cls] in front of every word. So the input looks like

[CLS] M y _ [CLS] n a m e _ [CLS] i s

We then run a small encoder over each sequence. And then you take the encoded [CLS], and run

ICLR2025 Saturday Posters

Last edited: August 8, 2025

ICLR2025 Cassidy: AssistanceZero

Train reward predictor to also have rewards at test time
MCTS
Learn to match root node KL

ICLR2025 Liu: synthesizing programmatic reinforcement learning policies with LLM guided search

Hill climbing with partial mutations of generated programs of LLMs

ICLR2025 Weller: l PromptTrirver

ICLR2025 Yu: robust LLM safeguard via refusal feature adversarial training

With mechanistic interpretability, we can find a sub space which is correlated with refusal, pull that up

ICLR2025 Snell: Optimality of Scaling LLM Test-Time Compute

Last edited: August 8, 2025

Compute-Optimal Scaling

Compute-Optimal Scaling is the notion of selecting the optimal configuration (beam width, search budget, etc.) dynamically / for binned question.

Approaches to “Scaling Test-Time Compute”

Three primary approaches:

best-of-n: roll out a bunch, reject
Beam Search: check against intermediate
lookahead search: MCTSish (do lookahead rollouts)

Key insight

On easy qusetion, beam search shows over-optimization and best of n is good
on medium/hard questions, beam search is better

Lookahead seems bad?