Posts

ICLR2025 Jin: MOE++ zero computation experts

Last edited: August 8, 2025

Motivation

A fixed amount of experts is activated per task.

Key Insight

MoE++ allows the amount of expert distribution to be adaptive.

Method

Three key contributions:

  1. zero-computation experts: discarding input \(E\qty(x) = 0\), copy input \(E\qty(x) = x\) (“skip”), const \(E(x) = \alpha_{a} x +\alpha_{b} v_{\theta}\) (plus normallFFN experts)
  2. pathway-aware router (with additional loss augmentation where we learn a \(\tau_{\theta}\) to decide
  3. something else I missed

zero-computation experts

  1. simple to handle easy tokens quickly
  2. new experts is relatively low cost

ICLR2025 Kilani: MrT5 Tokenizer-Free

Last edited: August 8, 2025

Motivation

ByteT5 is very expensive (because you have to have a residual on every damn token)

MrT5

MrT5 uses a soft attention masking gate at pretraining time to delete unused tokens; at inference time we use a hard cut.

Cool: MrT5 learns language independent compression rate (different languages have different rates).

ICLR2025 Li: MoE is secretly an embedding

Last edited: August 8, 2025

motivation

Can we directly extract embeddings from MoE forwarding routing weights (i.e., compared to traditional residual stream information)?

Key Insight

Using residual states vs. forwarding weights as semantic searc embeddings offer complementary strengths (i.e., when one method fails, the other one succeeds more)

Method

Create an aggregate embedding:

\begin{equation} E_{j} = X_{j} + \alpha W_{j} \end{equation}

where \(W_{j}\) is the routing weight of the residual, and \(X_{j}\) is the residual.

ICLR2025 Mathur: MIND Adaptive Thinking with Dynamic Computation

Last edited: August 8, 2025

Motivation

Standard computation doesn’t adapt.

Fixed-Point Iteration for Adaptation

method: CNN

  1. for every layer, perform fixed-point iteration until convergence to mask out (what exactly?)
  2. supervise also an “introspection model” to skip the entire fixed point
  3. loss: LM + supervision for the introspection model

method: MIND-transformer

  1. for every layer, perform fixed-point iteration until attention activation convergence
  2. ditto introspection as above