EMNLP2025 Eo: Expert Generalization in MoE in IFT

One-Liner

cluster the input, activate a seperate expert group for cluster target.

Motivation

  • heterogeneity of input instruction tuning data poses difficulty for MoE
  • routing only operates at token level, so can’t deal with sequence level generalization

Novelty

Architecture to enable hierarchical expert routing.

Notable Methods

Mixure of Clustered Experts

Mixture of Clustered Experts

Dual-stage routing mechanism.

  1. group the \(M\) experts into groups of \(N\) expert (i.e. \(M = \qty(N, \dots, N)\)
  2. k-means clustering the sequence embedding at input
  3. given the assigned cluster, only route to the assigned subgroup

Results

  • outperforms MoE baselines
  • demonstrate expert group specialization