SU-CS224N APR232024
Last edited: August 8, 2025Evaluating Machine Translation
BLEU
Compare machine vs. multiple-human reference translations. Uses N-Gram geometric mean—the actual n gram size isn’t super special.
Original idea to have multiple reference translations—but maybe people to do this only one reference translation—good score in expectation.
Limitations
- good translation can get a bad BLEU because it has low n gram overlap
- penalty to too-short system translations (i.e. translating only easy sentences isn’t a good metric)
- you really can’t get to 100 in BLEU because of variations in text
attention
Given a vector of values, a vector query, attention is a technique to compute a weighted sum of the values depending on the query.
SU-CS224N APR252024
Last edited: August 8, 2025Transformers
Motivation
Lower Sequence-Length Time Complexity
Minimize Linear Interaction Distance
The interaction distances scale by \(O(l)\) with \(l\) sequence length—gradient is affected by linear interaction distance: linear order is baked in.
Maximize Parallelization
Forward and backward passes require waiting (waiting for it to roll from left to right)—-instead, you can compute attention in parallel.
Key Advantage
- Maximum interaction distance is \(O(1)\) — each word is connected to each other word
- Unparallizable operation does not increase by sequence length
Self-Attention
Self-attention is formulated as each word in a sequence attending to each word in the same sequence.
SU-CS224N APR302024
Last edited: August 8, 2025Subword
We use SUBWORD modeling modeling to deal with:
- combinatorial morphology (resolving word form and infinitives) — “a single word has a million forms in Finnish” (“transformify”)
- misspelling
- extensions/emphasis (“gooooood vibessssss”)
You mark each actual word ending with some of combine marker.
To fix this:
Byte-Pair Encoding
“find pieces of words that are common and treat them as a vocabulary”
- start with vocab containing only characters and EOS
- look at the corpus, and find the most common pair of adjacent characters
- replace all instances of the pair with the new subword
- repeat 2-3 until vecab size is big enough
Writing Systems
- phonemic (directly translating sounds, see Spanish)
- fossilized phonemic (English, where sounds are whack)
- syllabic/moratic (each sound syllable written down)
- ideographic (syllabic, but no relation to sound instead have meaning)
- a combination of the above (Japanese)
Whole-Model Pretraining
- all parameters are initalized via pretraining
- don’t even bother training word vectors
MLM and NTP are “Universal Tasks”
Because in different circumstances, performing well MLM and NLP requires {local knowledge, scene representations, language, etc.}.
SU-CS224N MAY022024
Last edited: August 8, 2025Zero-Shot Learning
GPT-2 is able to do many tasks with not examples + no gradient updates.
Instruction Fine-Tuning
Language models, by default, are not aligned with user intent.
- collect paired examples of instruction + output across many tasks
- then, evaluate on unseen tasks
~3 million examples << n billion examples
dataset: MMLU
You can generate an Instruction Fine-Tuning dataset by asking a larger model for it (see Alpaca).
Pros + Cons
- simple and straightforward + generalize to unseen tasks
- but, its EXPENSIVE to collect ground truth data
- ground truths maybe wrong
- creative tasks may not have a correct answer
- LMs penalizes all token-level mistakes equally, but some mistakes are worse than others
- humans may generate suboptimal answers
Human Preference Modeling
Imagine if we have some input \(x\), and two output trajectories, \(y_{1}\) and \(y_{2}\).
SU-CS224N MAY072024
Last edited: August 8, 2025Benchmark tradeoffs
- baseline too high: no one can beat it
- baseline too low: no differentiation
Close-ended evaluation
- do standard ML (“accuracy”)
- because there’s one of a few known answers
- types of tasks: SST, IMDP, Yelp; SNLI
Most common multi-task benchmark: SuperGLUE
Difficult
- what metrics do you choose?
- how to aggregate across metrics (average?)
- label statistics
- spurious correlations
Open-ended evaluations
- long generations with too many correct answers (can’t directly apply classic ML)
- there are better and worse answers (relative)
Content Overlap Metrics
compare lexical similarity between generated and gold text: