Houjun Liu

ICLR2025 Jin: MOE++ zero computation experts

Motivation

A fixed amount of experts is activated per task.

Key Insight

MoE++ allows the amount of expert distribution to be adaptive.

Method

Three key contributions:

zero-computation experts: discarding input \(E\qty(x) = 0\), copy input \(E\qty(x) = x\) (“skip”), const \(E(x) = \alpha_{a} x +\alpha_{b} v_{\theta}\) (plus normallFFN experts)
pathway-aware router (with additional loss augmentation where we learn a \(\tau_{\theta}\) to decide
something else I missed

zero-computation experts

simple to handle easy tokens quickly
new experts is relatively low cost