MOEReview Gale: MegaBlocks

Standard MoEs either waste computation by padding unused capacity within each expert, or drop tokens assigned to an expert when it exceeds capacity (i.e. truncate so that we don’t have to pad too much).

Method

Instead of

we do

and leverage efficient block sparse multiplication to have variably-sized experts.