EMNLP2025 Eo: Expert Generalization in MoE in IFT

EMNLP2025 Eo: Expert Generalization in MoE in IFT

One-Liner

cluster the input, activate a seperate expert group for cluster target.

Motivation

heterogeneity of input instruction tuning data poses difficulty for MoE
routing only operates at token level, so can’t deal with sequence level generalization

Novelty

Architecture to enable hierarchical expert routing.

Notable Methods

Mixure of Clustered Experts

Mixture of Clustered Experts

Dual-stage routing mechanism.

group the \(M\) experts into groups of \(N\) expert (i.e. \(M = \qty(N, \dots, N)\)
k-means clustering the sequence embedding at input
given the assigned cluster, only route to the assigned subgroup

Results

outperforms MoE baselines
demonstrate expert group specialization