_index.org

SU-CS224N APR252024

Last edited: August 8, 2025

Transformers

Motivation

Lower Sequence-Length Time Complexity

Minimize Linear Interaction Distance

The interaction distances scale by \(O(l)\) with \(l\) sequence length—gradient is affected by linear interaction distance: linear order is baked in.

Maximize Parallelization

Forward and backward passes require waiting (waiting for it to roll from left to right)—-instead, you can compute attention in parallel.

Key Advantage

  1. Maximum interaction distance is \(O(1)\) — each word is connected to each other word
  2. Unparallizable operation does not increase by sequence length

Self-Attention

Self-attention is formulated as each word in a sequence attending to each word in the same sequence.

SU-CS224N APR302024

Last edited: August 8, 2025

Subword

We use SUBWORD modeling modeling to deal with:

  1. combinatorial morphology (resolving word form and infinitives) — “a single word has a million forms in Finnish” (“transformify”)
  2. misspelling
  3. extensions/emphasis (“gooooood vibessssss”)

You mark each actual word ending with some of combine marker.

To fix this:

Byte-Pair Encoding

“find pieces of words that are common and treat them as a vocabulary”

  1. start with vocab containing only characters and EOS
  2. look at the corpus, and find the most common pair of adjacent characters
  3. replace all instances of the pair with the new subword
  4. repeat 2-3 until vecab size is big enough

Writing Systems

  • phonemic (directly translating sounds, see Spanish)
  • fossilized phonemic (English, where sounds are whack)
  • syllabic/moratic (each sound syllable written down)
  • ideographic (syllabic, but no relation to sound instead have meaning)
  • a combination of the above (Japanese)

Whole-Model Pretraining

  • all parameters are initalized via pretraining
  • don’t even bother training word vectors

MLM and NTP are “Universal Tasks”

Because in different circumstances, performing well MLM and NLP requires {local knowledge, scene representations, language, etc.}.

SU-CS224N MAY022024

Last edited: August 8, 2025

Zero-Shot Learning

GPT-2 is able to do many tasks with not examples + no gradient updates.

Instruction Fine-Tuning

Language models, by default, are not aligned with user intent.

  1. collect paired examples of instruction + output across many tasks
  2. then, evaluate on unseen tasks

~3 million examples << n billion examples

dataset: MMLU

You can generate an Instruction Fine-Tuning dataset by asking a larger model for it (see Alpaca).

Pros + Cons

  • simple and straightforward + generalize to unseen tasks
  • but, its EXPENSIVE to collect ground truth data
    • ground truths maybe wrong
    • creative tasks may not have a correct answer
    • LMs penalizes all token-level mistakes equally, but some mistakes are worse than others
    • humans may generate suboptimal answers

Human Preference Modeling

Imagine if we have some input \(x\), and two output trajectories, \(y_{1}\) and \(y_{2}\).

SU-CS224N MAY072024

Last edited: August 8, 2025

Benchmark tradeoffs

  • baseline too high: no one can beat it
  • baseline too low: no differentiation

Close-ended evaluation

  • do standard ML (“accuracy”)
  • because there’s one of a few known answers
  • types of tasks: SST, IMDP, Yelp; SNLI

Most common multi-task benchmark: SuperGLUE

Difficult

  • what metrics do you choose?
  • how to aggregate across metrics (average?)
  • label statistics
  • spurious correlations

Open-ended evaluations

  • long generations with too many correct answers (can’t directly apply classic ML)
  • there are better and worse answers (relative)

Content Overlap Metrics

compare lexical similarity between generated and gold text:

SU-CS224N MAY092024

Last edited: August 8, 2025

Floating Point

4 bytes

\begin{equation} (-1)^{B} + e^{E-127} \times \qty(1 + \sum_{i=1}^{23} b_{23-i}2^{-i}) \end{equation}

usually \(E\) is a 8 bytes, and 23 digits of \(b\).

With more \(E\), we will have more range, with more \(b\), we will have more precision.

Mixed Precision Training

  1. copy the model in FP32
  2. Run forward pass in FP16
  3. Scale loss to be large enough to not be rounded away
  4. Compute gradients in FP16
  5. Convert the gradients onto FP32
  6. Scale the gradients down
  7. apply to the model

BFloat16

To not need to scale, we can use a scheme that has less precision but the same amount of dynamic range (i.e. allocate the same \(E\), chop off \(b\)) —no need to scale, just have more dynamic range.