MOEReview Zhang: Mixure of Attention Heads

Split \(Q\) projection and attention out projection into experts, with one router coordinating them.

Better than MHA performanec.