Houjun Liu

ICLR2025 Li: MoE is secretly an embedding

motivation

Can we directly extract embeddings from MoE forwarding routing weights (i.e., compared to traditional residual stream information)?

Key Insight

Using residual states vs. forwarding weights as semantic searc embeddings offer complementary strengths (i.e., when one method fails, the other one succeeds more)

Method

Create an aggregate embedding:

\begin{equation} E_{j} = X_{j} + \alpha W_{j} \end{equation}

where \(W_{j}\) is the routing weight of the residual, and \(X_{j}\) is the residual.