Houjun Liu

Build a System, Not a Monolith

“How do we build well developed AI systems without a bangin’ company”

Two main paradigms

  • transfer learning: (pretrain a model, and) faster convergence, better performance
  • *monolithic models: (pretrain a model, and) just use the pretrained model

Problems with monolythic models

  • Continual development of large language models mostly don’t exist: no incremental updates
  • To get better improvements, we throw out the old monolythic model
  • Most of the research community can’t participate in their development

New Alternative Paradigm

  • A very simple routing layer
  • A very large collection of specialist models all from a base model
  • Collaborative model development means that a large amount of contributors can band together to contribute to the development of the models

Why

  • Specialist models are cheaper and better to train
    • few shot parameter efficient fine tuning is better liu et al
    • few shot fine-tuning is better than few-shot in-context learning
  • Specialist models can be communicable, incremental updates to a base model
    • think: PEFT
    • each of the specialist models can only need to update a small percent of the weights
    • think “adapters”: parameter efficient updates

Routing

  • task2vec: task embedding for meta learning Achille et al
  • efficiently tuned parameters are task embeddings Zhou et al

distinction between MoE

  • instead of routing in sub-layer level routing, we are routing at the input level
  • we look at the

Novel Tasks (Model Merging)

Tasks can be considered as a composition of skills.

  1. each task can be encoded as a composition of skills
  2. we can merge the skills of sub-models

Usual updates

  1. we take a pretrained model
  2. we adapt it to some target task

Model Merging

  • Fisher-weight averaging

    “Merging models with fisher-weight averaging”, Matena et al Merging can be shown as an optimization problem:

    \begin{equation} argmax_{\theta} \sum_{i-1}^{M} \lambda_{i} \log p(\theta \mid \mathcal{D}_{i}) \end{equation}

    “a merged model is the set of parameters which would maximize the log-posterior of each model \(\mathcal{D}_{i}\), controlled by \(\lambda_{i}\)”

  • Task arthmetic

    “Editing models with Task Arthmetic”, llharco et al “Resolving inference when merging models” by Yadev et al

    You can create multi-task models by just doing maff:

    \begin{equation} \tau_{1} = \theta_{finetune_{1}} - \theta_{pretrain} \end{equation}

    \begin{equation} \tau_{2} = \theta_{finetune_{2}} - \theta_{pretrain} \end{equation}

    \begin{equation} \theta_{finetune_{1+2}} = (\tau_{1} + \tau_{2}) + \theta_{pretrain} \end{equation}

    this apparently works ok.

  • Soft MoE

    Soft merging of experts with adaptive routing, Muqeeth et al

    MoE, but instead of choosing an expert to activate, the router’s probability densities will result in a mixture of the experts’ weights. So, mulitple experts can be invoked in a linear way.

Git-Theta

Git-Theta: A Git Extension for Collaborative Development of Machine Learning Models, Kandpal et al

Communal and iterative development of model checkpoints. Saves only LoRA’d parameters, and removes any weights that didn’t change between diffs.

Petals

Petals: Collaborative Inference and Fine-Tuning of Large Models, Borzunov et al.

Distributed fine-tuning and model inference by using different sub-worker nodes to run different layers of the model.

https://health.petals.dev/