Posts

SU-CS224N APR232024

Last edited: August 8, 2025

Evaluating Machine Translation

BLEU

Compare machine vs. multiple-human reference translations. Uses N-Gram geometric mean—the actual n gram size isn’t super special.

Original idea to have multiple reference translations—but maybe people to do this only one reference translation—good score in expectation.

Limitations

  • good translation can get a bad BLEU because it has low n gram overlap
  • penalty to too-short system translations (i.e. translating only easy sentences isn’t a good metric)
  • you really can’t get to 100 in BLEU because of variations in text

attention

Given a vector of values, a vector query, attention is a technique to compute a weighted sum of the values depending on the query.

SU-CS224N APR252024

Last edited: August 8, 2025

Transformers

Motivation

Lower Sequence-Length Time Complexity

Minimize Linear Interaction Distance

The interaction distances scale by \(O(l)\) with \(l\) sequence length—gradient is affected by linear interaction distance: linear order is baked in.

Maximize Parallelization

Forward and backward passes require waiting (waiting for it to roll from left to right)—-instead, you can compute attention in parallel.

Key Advantage

  1. Maximum interaction distance is \(O(1)\) — each word is connected to each other word
  2. Unparallizable operation does not increase by sequence length

Self-Attention

Self-attention is formulated as each word in a sequence attending to each word in the same sequence.

SU-CS224N APR302024

Last edited: August 8, 2025

Subword

We use SUBWORD modeling modeling to deal with:

  1. combinatorial morphology (resolving word form and infinitives) — “a single word has a million forms in Finnish” (“transformify”)
  2. misspelling
  3. extensions/emphasis (“gooooood vibessssss”)

You mark each actual word ending with some of combine marker.

To fix this:

Byte-Pair Encoding

“find pieces of words that are common and treat them as a vocabulary”

  1. start with vocab containing only characters and EOS
  2. look at the corpus, and find the most common pair of adjacent characters
  3. replace all instances of the pair with the new subword
  4. repeat 2-3 until vecab size is big enough

Writing Systems

  • phonemic (directly translating sounds, see Spanish)
  • fossilized phonemic (English, where sounds are whack)
  • syllabic/moratic (each sound syllable written down)
  • ideographic (syllabic, but no relation to sound instead have meaning)
  • a combination of the above (Japanese)

Whole-Model Pretraining

  • all parameters are initalized via pretraining
  • don’t even bother training word vectors

MLM and NTP are “Universal Tasks”

Because in different circumstances, performing well MLM and NLP requires {local knowledge, scene representations, language, etc.}.

SU-CS224N MAY022024

Last edited: August 8, 2025

Zero-Shot Learning

GPT-2 is able to do many tasks with not examples + no gradient updates.

Instruction Fine-Tuning

Language models, by default, are not aligned with user intent.

  1. collect paired examples of instruction + output across many tasks
  2. then, evaluate on unseen tasks

~3 million examples << n billion examples

dataset: MMLU

You can generate an Instruction Fine-Tuning dataset by asking a larger model for it (see Alpaca).

Pros + Cons

  • simple and straightforward + generalize to unseen tasks
  • but, its EXPENSIVE to collect ground truth data
    • ground truths maybe wrong
    • creative tasks may not have a correct answer
    • LMs penalizes all token-level mistakes equally, but some mistakes are worse than others
    • humans may generate suboptimal answers

Human Preference Modeling

Imagine if we have some input \(x\), and two output trajectories, \(y_{1}\) and \(y_{2}\).

SU-CS224N MAY072024

Last edited: August 8, 2025

Benchmark tradeoffs

  • baseline too high: no one can beat it
  • baseline too low: no differentiation

Close-ended evaluation

  • do standard ML (“accuracy”)
  • because there’s one of a few known answers
  • types of tasks: SST, IMDP, Yelp; SNLI

Most common multi-task benchmark: SuperGLUE

Difficult

  • what metrics do you choose?
  • how to aggregate across metrics (average?)
  • label statistics
  • spurious correlations

Open-ended evaluations

  • long generations with too many correct answers (can’t directly apply classic ML)
  • there are better and worse answers (relative)

Content Overlap Metrics

compare lexical similarity between generated and gold text: