SU-CS224N APR252024
Last edited: August 8, 2025Transformers
Motivation
Lower Sequence-Length Time Complexity
Minimize Linear Interaction Distance
The interaction distances scale by \(O(l)\) with \(l\) sequence length—gradient is affected by linear interaction distance: linear order is baked in.
Maximize Parallelization
Forward and backward passes require waiting (waiting for it to roll from left to right)—-instead, you can compute attention in parallel.
Key Advantage
- Maximum interaction distance is \(O(1)\) — each word is connected to each other word
- Unparallizable operation does not increase by sequence length
Self-Attention
Self-attention is formulated as each word in a sequence attending to each word in the same sequence.
SU-CS224N APR302024
Last edited: August 8, 2025Subword
We use SUBWORD modeling modeling to deal with:
- combinatorial morphology (resolving word form and infinitives) — “a single word has a million forms in Finnish” (“transformify”)
- misspelling
- extensions/emphasis (“gooooood vibessssss”)
You mark each actual word ending with some of combine marker.
To fix this:
Byte-Pair Encoding
“find pieces of words that are common and treat them as a vocabulary”
- start with vocab containing only characters and EOS
- look at the corpus, and find the most common pair of adjacent characters
- replace all instances of the pair with the new subword
- repeat 2-3 until vecab size is big enough
Writing Systems
- phonemic (directly translating sounds, see Spanish)
- fossilized phonemic (English, where sounds are whack)
- syllabic/moratic (each sound syllable written down)
- ideographic (syllabic, but no relation to sound instead have meaning)
- a combination of the above (Japanese)
Whole-Model Pretraining
- all parameters are initalized via pretraining
- don’t even bother training word vectors
MLM and NTP are “Universal Tasks”
Because in different circumstances, performing well MLM and NLP requires {local knowledge, scene representations, language, etc.}.
SU-CS224N MAY022024
Last edited: August 8, 2025Zero-Shot Learning
GPT-2 is able to do many tasks with not examples + no gradient updates.
Instruction Fine-Tuning
Language models, by default, are not aligned with user intent.
- collect paired examples of instruction + output across many tasks
- then, evaluate on unseen tasks
~3 million examples << n billion examples
dataset: MMLU
You can generate an Instruction Fine-Tuning dataset by asking a larger model for it (see Alpaca).
Pros + Cons
- simple and straightforward + generalize to unseen tasks
- but, its EXPENSIVE to collect ground truth data
- ground truths maybe wrong
- creative tasks may not have a correct answer
- LMs penalizes all token-level mistakes equally, but some mistakes are worse than others
- humans may generate suboptimal answers
Human Preference Modeling
Imagine if we have some input \(x\), and two output trajectories, \(y_{1}\) and \(y_{2}\).
SU-CS224N MAY072024
Last edited: August 8, 2025Benchmark tradeoffs
- baseline too high: no one can beat it
- baseline too low: no differentiation
Close-ended evaluation
- do standard ML (“accuracy”)
- because there’s one of a few known answers
- types of tasks: SST, IMDP, Yelp; SNLI
Most common multi-task benchmark: SuperGLUE
Difficult
- what metrics do you choose?
- how to aggregate across metrics (average?)
- label statistics
- spurious correlations
Open-ended evaluations
- long generations with too many correct answers (can’t directly apply classic ML)
- there are better and worse answers (relative)
Content Overlap Metrics
compare lexical similarity between generated and gold text:
SU-CS224N MAY092024
Last edited: August 8, 2025Floating Point
4 bytes
\begin{equation} (-1)^{B} + e^{E-127} \times \qty(1 + \sum_{i=1}^{23} b_{23-i}2^{-i}) \end{equation}
usually \(E\) is a 8 bytes, and 23 digits of \(b\).
With more \(E\), we will have more range, with more \(b\), we will have more precision.
Mixed Precision Training
- copy the model in FP32
- Run forward pass in FP16
- Scale loss to be large enough to not be rounded away
- Compute gradients in FP16
- Convert the gradients onto FP32
- Scale the gradients down
- apply to the model
BFloat16
To not need to scale, we can use a scheme that has less precision but the same amount of dynamic range (i.e. allocate the same \(E\), chop off \(b\)) —no need to scale, just have more dynamic range.
