One-Liner
cluster the input, activate a seperate expert group for cluster target.
Motivation
- heterogeneity of input instruction tuning data poses difficulty for MoE
- routing only operates at token level, so can’t deal with sequence level generalization
Novelty
Architecture to enable hierarchical expert routing.
Notable Methods
Mixture of Clustered Experts
Dual-stage routing mechanism.
- group the \(M\) experts into groups of \(N\) expert (i.e. \(M = \qty(N, \dots, N)\)
- k-means clustering the sequence embedding at input
- given the assigned cluster, only route to the assigned subgroup
Results
- outperforms MoE baselines
- demonstrate expert group specialization
