Houjun Liu

tokenization

Every NLP task involve some kind of text normalization.

  1. tokenizing words
  2. normalizing word formats (lemmatize?)
  3. sentence and paragraph segmentation

For Latin, Arabic, Cyrillic, Greek systems, spaces can usually be used for tokenization. Other writing systems can’t do this. See morpheme

Subword Tokenization

Algorithms for breaking up tokens using corpus statistics which acts on lower-than-word level.

  • BPE
  • Unigram Language Modeling tokenization
  • WordPiece

They all work in 2 parst:

  • a token learner: takes training corpus and derives a vocabulary set
  • a token segmenter that tokenizes text according to the vocab

tr

For those languages, you can use these systems to perform tokenization.

tr -sc "A-Za-z" "\n" < input.txt

this takes every form which is not text (-c is the complement operator) and replaces it with a newline. -s squeezes the text so that there are not multiple newlines.

This turns the text into one word per line.

Sorting it (because uniq requires it) and piping into uniq gives word count

tr -sc "A-Za-z" "\n" < input.txt | sort | uniq

We can then do a reverse numerical sort:

tr -sc "A-Za-z" "\n" < input.txt | sort | uniq | sort -r -n

which gives a list of words per frequency.

This is a BAD RESULT most of the time: some words have punctuation with meaning that’s not tokenizaiton: m.p.h., or AT&T, or John's, or 1/1/12.

What to Tokenize

“I do uh main- mainly business data processing”

  • uh: filled pause
  • main-: fragments

Consider:

“Seuss’s cat in the cat is different from other cats!”

  • cat and cats: same lemma (i.e. stem + part of speech + word sense)
  • cat and cats: different wordforms

We usually consider a token as distinct wordform, counting duplicates; whereas, we usually consider word types as unique, non-duplicated distinct wordforms.

clitics

John's: word that doesn’t stand on its own.