MOEReview Yun: Inference-Optimal MoEs

“the scaling law (Section 3) shows that more experts (larger E) result in a higher performance; on the other hand, more experts result in a larger inference cost (Section 4.2)”

How do we trade off cost of more experts (in terms of GPU-seconds or , for \(C_0\) being the cost for some per second GPU cost) and performance?