MOEReview Fedus: Switch Transformers

At scale, with regularization (including dropout), k=1 on expert routing is fine!