Motivation
A fixed amount of experts is activated per task.
Key Insight
MoE++ allows the amount of expert distribution to be adaptive.
Method
Three key contributions:
- zero-computation experts: discarding input \(E\qty(x) = 0\), copy input \(E\qty(x) = x\) (“skip”), const \(E(x) = \alpha_{a} x +\alpha_{b} v_{\theta}\) (plus normallFFN experts)
- pathway-aware router (with additional loss augmentation where we learn a \(\tau_{\theta}\) to decide
- something else I missed
zero-computation experts
- simple to handle easy tokens quickly
- new experts is relatively low cost