motivation
Can we directly extract embeddings from MoE forwarding routing weights (i.e., compared to traditional residual stream information)?
Key Insight
Using residual states vs. forwarding weights as semantic searc embeddings offer complementary strengths (i.e., when one method fails, the other one succeeds more)
Method
Create an aggregate embedding:
\begin{equation} E_{j} = X_{j} + \alpha W_{j} \end{equation}
where \(W_{j}\) is the routing weight of the residual, and \(X_{j}\) is the residual.