Houjun Liu

Language Model

An LM predicts this distribution (“what’s the distribution of next word given the previous words):

\begin{equation} W_{n} \sim P(\cdot | w^{(t-1)}, w^{(t-2)}, \dots, w^{(1)}) \end{equation}

By applying the chain rule, we can also think of the language model as assigning a probability to a sequence of words:

\begin{align} P(S) &= P(w^{(t)} | w^{(t-1)}, w^{(t-2)}, \dots, w^{(1)}) \cdot P(w^{(t-1)} | w^{(t-2)}, \dots, w^{(1)}) \dots \\ &= P(w^{(t)}, w^{(t-1)}, w^{(t-2)}, \dots, w^{(1)}) \end{align}

N-Gram LM

We can use the Markov assumption: that a word only dependents on the last \(n-1\) terms.

By doing this, we can write:

\begin{align} P(x_{t} | x_{t-1}, .. x_{0}) &\approx P(x_{t} | x_{t-1}, .. x_{t-n-1}) \\ &= \frac{P(x_{t}, x_{t-1}, .. x_{t-n-1})}{P(x_{t-1}, .. x_{t-n-1})} \end{align}

and we can then just use the counts to make this prediction.


Key idea: language is extremely sparse

Conflicting demands: more context is better, but with more context had more sparseness problems.

Neural LM

Same task; we want a model for

\begin{equation} W_{n} \sim P(\cdot | w^{(t-1)}, w^{(t-2)}, \dots, w^{(1)}) \end{equation}

We use the same Markov assumption, and then use a fixed local window of text to try to predict the next word.

Naively learning from concatenated word vectors is bad, because that’s weight are not length agnostic—that is, the position in the context in which a word occurs is very sensitive.

Recurrent Neural Network

Key RNN idea — apply the same weights \(W\) repeatedly. Each of our hidden state would be the sum of the previous hidden state by a matrix, and multiply the new word by a matrix. So each step:

\begin{equation} h_{t} = \sigma \qty( W_{h}h_{t-1} + W_{e} e_{t} + b) \end{equation}

Our LM input uses “Teacher Forcing”: i.e. we only generate a single input.

Usually we define our loss as a multilpying the loss with \(\frac{1}{T}\) where \(T\) is the token count per doc.


  • slow
  • hard to look into the past dramatically


“gradient w.r.t. a repeated weight is the sum of the gradient for each time where it appears”

Recall that gradient sum at outer branches, so calculate gradient at each step.

Sometimes you would give up after backdropping \(n\) steps (not all the way into history.

\begin{equation} \pdv{J}{W_{h}} = \sum_{i=1}^{t} \pdv{J^{(t)}}{W_{h}} \end{equation}