Houjun Liu

corpus

usually we use \(N\) to denote the number of tokens, and \(V\) the “vocab” or set of word types.

Corpora is usually considered in context of:

  • specific writers
  • at specific time
  • for specific varieties
  • of specific languages
  • for a specific function

Particularly hard: code switching, gender, demographics, variety, etc.

Herdan’s Law

\begin{equation} |V| = kN^{\beta} \end{equation}

with \(\beta\) being a constant between \(0.67 < \beta < 0.75\).

The vocab size is roughly proportional to the number of tokens.