MoE Review Index

Project Thoughts

Overall Question: “Why is growing a better idea than training larger models from scratch?”

Cost of Specialization

Sub-Question: “how much does load balancing loss incur in terms of performance versus specialized data?”

For our goals, our data is much more specific (i.e. personalized), meaning we don’t necessarily need to rely on the ModuleFormer load balancing loss tricks.
Switch Transformers tells us that standard regularization, including dropout, means that one expert can be sufficient to answer many questions (perhaps 1+1 like in shared expert setups)

How Much Expert is an Expert?

Sub-Question: “do all experts have to have the same representational power?”

Sub-Question: “…or parameter size for that matter?”

given MegaBlocks, we know different experts can take different numbers of tokens and be fine (+in fact more efficient)
given Universal Subspace Hypothesis, we know that training a bunch of things the same way results in a shared substructure, meaning if we have a 1+1 shared expert setup its not certain we will need huge ‘add-on experts’ if we have a good shared expert

Building a Better Expert

Sub-Question: “how can we mechanistically engineer experts with the properties we want?”

Given LAZER, we know that we can surgically extract subspaces with large representational power, and it may not be that many + it could boost performance
If we took a bunch of low-rank representations, can we Branch-Train-Merge’em together in a way to gain more expressive power?

Reading

Mostly one-line/key insight level notes on the references.

Growing

Architecture

Representation

Implementation

Scaling Laws

Questions

Re MOEReview Sharma: LAZER: Unclear if this generalizes OOD, i.e. is table 1 OOD by search on one train and many test or many train and many test?
Re MOEReview Gale: MegaBlocks: Why can’t MoEs not drop tokens at all? i.e. why can’t we just trust what the router does and assign + pad. Or is the idea that most routers will be so imbalanced that there will be a bunch of padding.