MOEReview Fedus: Switch Transformers
At scale, with regularization (including dropout), k=1 on expert routing is fine!