Broyden-Fletcher-Goldfarb-SHanno (BFGS)
Last edited: August 8, 2025Broyden's Method
Last edited: August 8, 2025Broyden’s method is an approximate method to estimate the Jacobian. We give the root-finding variant here (i.e. the search direction is for finding \(F\qty(x) = 0\) instead of \(\min F\)).
For function \(F\), let: \(J^{(0)} = I\). For every step
- compute \(\Delta c^{(q)}\) from \(J^{(q)}\Delta c^{(q)} = -F\qty(c^{(q)})\)
- compute \(\arg\min_{\alpha} F\qty(c^{(q)} + \alpha \Delta c^{(q)})^{T}F\qty(c^{(q)} + \alpha \Delta c^{(q)})\) for root finding
- compute \(c^{(q+1)} = c^{(q)} + \alpha \Delta c^{(q)}\)
- set \(\Delta c^{(q)} = c^{(q+1)} - c^{(q)}\)
- finally, we can update \(J\) such that…
\begin{equation} J^{(q+1)} = J^{(q)} + \frac{1}{\qty(\Delta c^{(q)})^{T} \Delta c^{(q)}} \qty(F\qty(c^{(q+1)}) - F\qty(c^{(q)}) - J^{(q)}\Delta c^{(q)}) \qty(\Delta c^{(q)})^{T} \end{equation}
buffer overflow
Last edited: August 8, 2025buffer overflow happens when operations like stcpy
runs beyond the edge of the allocated buffer. We need to find and fix buffer overflows, which causes people who use o
buffer overflow horror stories
- AOL messanger
identifying buffer overflows
Think about whether or not what you are going to do will cause buffer overflows. There are stuff which you shouldn’t do:
https://www.acm.org/code-of-ethics “Design and implement systems that are robustly and usably secure.”
Build a System, Not a Monolith
Last edited: August 8, 2025“How do we build well developed AI systems without a bangin’ company”
Two main paradigms
- transfer learning: (pretrain a model, and) faster convergence, better performance
- *monolithic models: (pretrain a model, and) just use the pretrained model
Problems with monolythic models
- Continual development of large language models mostly don’t exist: no incremental updates
- To get better improvements, we throw out the old monolythic model
- Most of the research community can’t participate in their development
New Alternative Paradigm
- A very simple routing layer
- A very large collection of specialist models all from a base model
- Collaborative model development means that a large amount of contributors can band together to contribute to the development of the models
Why
- Specialist models are cheaper and better to train
- few shot parameter efficient fine tuning is better liu et al
- few shot fine-tuning is better than few-shot in-context learning
- Specialist models can be communicable, incremental updates to a base model
- think: PEFT
- each of the specialist models can only need to update a small percent of the weights
- think “adapters”: parameter efficient updates
Routing
- task2vec: task embedding for meta learning Achille et al
- efficiently tuned parameters are task embeddings Zhou et al
distinction between MoE
- instead of routing in sub-layer level routing, we are routing at the input level
- we look at the
Novel Tasks (Model Merging)
Tasks can be considered as a composition of skills.
Byte-Pair Encoding
Last edited: August 8, 2025BPE is a common Subword Tokenization scheme.
Training
- choose two symbols that are most frequency adjacent
- merge those two symbols as one symbol throughout the text
- repeat to step \(1\) until we merge \(k\) times
v = set(corpus.characters())
for i in range(k):
tl, tr = get_most_common_bigram(v)
tnew = f"{tl}{tr}"
v.push(tnew)
corpus.replace((tl,tr), tnew)
return v
Most commonly, BPE is not ran alone: it usually run inside space separation systems. Hence, after each word we usually put a special _
token which delineates end of word.