Posts

Broyden-Fletcher-Goldfarb-SHanno (BFGS)

Last edited: August 8, 2025

Broyden's Method

Last edited: August 8, 2025

Broyden’s method is an approximate method to estimate the Jacobian. We give the root-finding variant here (i.e. the search direction is for finding \(F\qty(x) = 0\) instead of \(\min F\)).

For function \(F\), let: \(J^{(0)} = I\). For every step

  • compute \(\Delta c^{(q)}\) from \(J^{(q)}\Delta c^{(q)} = -F\qty(c^{(q)})\)
  • compute \(\arg\min_{\alpha} F\qty(c^{(q)} + \alpha \Delta c^{(q)})^{T}F\qty(c^{(q)} + \alpha \Delta c^{(q)})\) for root finding
  • compute \(c^{(q+1)} = c^{(q)} + \alpha \Delta c^{(q)}\)
  • set \(\Delta c^{(q)} = c^{(q+1)} - c^{(q)}\)
  • finally, we can update \(J\) such that…

\begin{equation} J^{(q+1)} = J^{(q)} + \frac{1}{\qty(\Delta c^{(q)})^{T} \Delta c^{(q)}} \qty(F\qty(c^{(q+1)}) - F\qty(c^{(q)}) - J^{(q)}\Delta c^{(q)}) \qty(\Delta c^{(q)})^{T} \end{equation}

buffer overflow

Last edited: August 8, 2025

buffer overflow happens when operations like stcpy runs beyond the edge of the allocated buffer. We need to find and fix buffer overflows, which causes people who use o

buffer overflow horror stories

  • AOL messanger

identifying buffer overflows

Think about whether or not what you are going to do will cause buffer overflows. There are stuff which you shouldn’t do:

  • strcpy: which keeps copying
  • strcat:
  • gets: it keeps taking input forever and forever

https://www.acm.org/code-of-ethics “Design and implement systems that are robustly and usably secure.”

Build a System, Not a Monolith

Last edited: August 8, 2025

“How do we build well developed AI systems without a bangin’ company”

Two main paradigms

  • transfer learning: (pretrain a model, and) faster convergence, better performance
  • *monolithic models: (pretrain a model, and) just use the pretrained model

Problems with monolythic models

  • Continual development of large language models mostly don’t exist: no incremental updates
  • To get better improvements, we throw out the old monolythic model
  • Most of the research community can’t participate in their development

New Alternative Paradigm

  • A very simple routing layer
  • A very large collection of specialist models all from a base model
  • Collaborative model development means that a large amount of contributors can band together to contribute to the development of the models

Why

  • Specialist models are cheaper and better to train
    • few shot parameter efficient fine tuning is better liu et al
    • few shot fine-tuning is better than few-shot in-context learning
  • Specialist models can be communicable, incremental updates to a base model
    • think: PEFT
    • each of the specialist models can only need to update a small percent of the weights
    • think “adapters”: parameter efficient updates

Routing

  • task2vec: task embedding for meta learning Achille et al
  • efficiently tuned parameters are task embeddings Zhou et al

distinction between MoE

  • instead of routing in sub-layer level routing, we are routing at the input level
  • we look at the

Novel Tasks (Model Merging)

Tasks can be considered as a composition of skills.

Byte-Pair Encoding

Last edited: August 8, 2025

BPE is a common Subword Tokenization scheme.

Training

  1. choose two symbols that are most frequency adjacent
  2. merge those two symbols as one symbol throughout the text
  3. repeat to step \(1\) until we merge \(k\) times
v = set(corpus.characters())
for i in range(k):
    tl, tr = get_most_common_bigram(v)
    tnew = f"{tl}{tr}"
    v.push(tnew)
    corpus.replace((tl,tr), tnew)
return v

Most commonly, BPE is not ran alone: it usually run inside space separation systems. Hence, after each word we usually put a special _ token which delineates end of word.