Brownian Motion
Last edited: August 8, 2025Brownian Motion is the pattern for measuring the convergence of random walk through continuous timing.
discrete random walk
discrete random walk is a tool used to construct Brownian Motion. It is a random walk which only takes on two discrete values at any given time: \(\Delta\) and its additive inverse \(-\Delta\). These two cases take place at probabilities \(\pi\) and \(1-\pi\).
Therefore, the expected return over each time \(k\) is:
\begin{equation} \epsilon_{k} = \begin{cases} \Delta, p(\pi) \\ -\Delta, p(1-\pi) \end{cases} \end{equation}
Broyden-Fletcher-Goldfarb-SHanno (BFGS)
Last edited: August 8, 2025Broyden's Method
Last edited: August 8, 2025Broyden’s method is an approximate method to estimate the Jacobian. We give the root-finding variant here (i.e. the search direction is for finding \(F\qty(x) = 0\) instead of \(\min F\)).
For function \(F\), let: \(J^{(0)} = I\). For every step
- compute \(\Delta c^{(q)}\) from \(J^{(q)}\Delta c^{(q)} = -F\qty(c^{(q)})\)
- compute \(\arg\min_{\alpha} F\qty(c^{(q)} + \alpha \Delta c^{(q)})^{T}F\qty(c^{(q)} + \alpha \Delta c^{(q)})\) for root finding
- compute \(c^{(q+1)} = c^{(q)} + \alpha \Delta c^{(q)}\)
- set \(\Delta c^{(q)} = c^{(q+1)} - c^{(q)}\)
- finally, we can update \(J\) such that…
\begin{equation} J^{(q+1)} = J^{(q)} + \frac{1}{\qty(\Delta c^{(q)})^{T} \Delta c^{(q)}} \qty(F\qty(c^{(q+1)}) - F\qty(c^{(q)}) - J^{(q)}\Delta c^{(q)}) \qty(\Delta c^{(q)})^{T} \end{equation}
buffer overflow
Last edited: August 8, 2025buffer overflow happens when operations like stcpy runs beyond the edge of the allocated buffer. We need to find and fix buffer overflows, which causes people who use o
buffer overflow horror stories
- AOL messanger
identifying buffer overflows
Think about whether or not what you are going to do will cause buffer overflows. There are stuff which you shouldn’t do:
https://www.acm.org/code-of-ethics “Design and implement systems that are robustly and usably secure.”
Build a System, Not a Monolith
Last edited: August 8, 2025“How do we build well developed AI systems without a bangin’ company”
Two main paradigms
- transfer learning: (pretrain a model, and) faster convergence, better performance
- *monolithic models: (pretrain a model, and) just use the pretrained model
Problems with monolythic models
- Continual development of large language models mostly don’t exist: no incremental updates
- To get better improvements, we throw out the old monolythic model
- Most of the research community can’t participate in their development
New Alternative Paradigm
- A very simple routing layer
- A very large collection of specialist models all from a base model
- Collaborative model development means that a large amount of contributors can band together to contribute to the development of the models
Why
- Specialist models are cheaper and better to train
- few shot parameter efficient fine tuning is better liu et al
- few shot fine-tuning is better than few-shot in-context learning
- Specialist models can be communicable, incremental updates to a base model
- think: PEFT
- each of the specialist models can only need to update a small percent of the weights
- think “adapters”: parameter efficient updates
Routing
- task2vec: task embedding for meta learning Achille et al
- efficiently tuned parameters are task embeddings Zhou et al
distinction between MoE
- instead of routing in sub-layer level routing, we are routing at the input level
- we look at the
Novel Tasks (Model Merging)
Tasks can be considered as a composition of skills.
