MOEReview Li: Branch-Train-Merge

And then when inference we can use domain-conditioned averaging between the experts by computing:

or by averaging the parameters of the experts.