Project Thoughts

Overall Question: “Why is growing a better idea than training larger models from scratch?”

Cost of Specialization

Sub-Question: “how much does load balancing loss incur in terms of performance versus specialized data?”

  • For our goals, our data is much more specific (i.e. personalized), meaning we don’t necessarily need to rely on the ModuleFormer load balancing loss tricks.
  • Switch Transformers tells us that standard regularization, including dropout, means that one expert can be sufficient to answer many questions (perhaps 1+1 like in shared expert setups)

How Much Expert is an Expert?

Sub-Question: “do all experts have to have the same representational power?”

Sub-Question: “…or parameter size for that matter?”

  • given MegaBlocks, we know different experts can take different numbers of tokens and be fine (+in fact more efficient)
  • given Universal Subspace Hypothesis, we know that training a bunch of things the same way results in a shared substructure, meaning if we have a 1+1 shared expert setup its not certain we will need huge ‘add-on experts’ if we have a good shared expert

Building a Better Expert

Sub-Question: “how can we mechanistically engineer experts with the properties we want?”

  • Given LAZER, we know that we can surgically extract subspaces with large representational power, and it may not be that many + it could boost performance
  • If we took a bunch of low-rank representations, can we Branch-Train-Merge’em together in a way to gain more expressive power?

Reading

Mostly one-line/key insight level notes on the references.

Growing

Architecture

Representation

Implementation

Scaling Laws

Questions

  • Re MOEReview Sharma: LAZER: Unclear if this generalizes OOD, i.e. is table 1 OOD by search on one train and many test or many train and many test?
  • Re MOEReview Gale: MegaBlocks: Why can’t MoEs not drop tokens at all? i.e. why can’t we just trust what the router does and assign + pad. Or is the idea that most routers will be so imbalanced that there will be a bunch of padding.