MOEReview Li: Branch-Train-Merge
  1. weighted parameter average of the existing experts (or copy the new perts)
  2. training each expert independently

And then when inference we can use domain-conditioned averaging between the experts by computing:

or by averaging the parameters of the experts.