MOEReview Krajewski: Scaling Laws for MoE

Define “granularity” as:

\begin{equation} G = \frac{d_{\text{ff}}}{d_{\text{expert}}} \end{equation}

at \(G=1\), we have a dense model; at \(G>1\), we have some kind of MoE.

Here are thy scaling laws:

notice how its mostly linear! tiny experts yay!