Houjun Liu

SU-CS224N APR302024


We use SUBWORD modeling modeling to deal with:

  1. combinatorial morphology (resolving word form and infinitives) — “a single word has a million forms in Finnish” (“transformify”)
  2. misspelling
  3. extensions/emphasis (“gooooood vibessssss”)

You mark each actual word ending with some of combine marker.

To fix this:

Byte-Pair Encoding

“find pieces of words that are common and treat them as a vocabulary”

  1. start with vocab containing only characters and EOS
  2. look at the corpus, and find the most common pair of adjacent characters
  3. replace all instances of the pair with the new subword
  4. repeat 2-3 until vecab size is big enough

Writing Systems

  • phonemic (directly translating sounds, see Spanish)
  • fossilized phonemic (English, where sounds are whack)
  • syllabic/moratic (each sound syllable written down)
  • ideographic (syllabic, but no relation to sound instead have meaning)
  • a combination of the above (Japanese)

Whole-Model Pretraining

  • all parameters are initalized via pretraining
  • don’t even bother training word vectors

MLM and NTP are “Universal Tasks”

Because in different circumstances, performing well MLM and NLP requires {local knowledge, scene representations, language, etc.}.

Why Pretraining

  • maybe local minima near pretraining weights generalize well
  • or maybe, because the outputs are sensible, gradients propagate nicely because they are modulated well

Types of Architecture


  • bidirectional context
  • can condition on the future


  1. replace input word with [mask] 80% of time
  2. replace input word with a RANDOM WORD 10% of the time
  3. leaving the word unchanged 10% of the time

i.e. BERT will then need to resolve a proper sentence representation from lots of noise

Original BERT also pretrained on top a next sentence prediction loss in addition to MLM, but that ended up being unnecessary.

  • Bertish

    1. RoBERTa - train on longer context
    2. SpanBert - mask a span


  • do both
  • pretraining maybe hard


Encoder/Decoder model. Pretraining task: blank inversion:

“Thank you for inviting me to your party last week”

“Thank you <x> to your <y> last week” => “<x> for inviting <y> party <z>

This actually is BETTER than the LM training objective.


  • general LMs use this
  • nice to generate from + cannot condition no future words

In-Context Learning

  • really only very capable at hundreds of billion parameters
  • uses no gradient steps—-repeat and attend to examples