Language Model

A machine learning model: input — last n words, output — probabilist distribution over the next word. An LM predicts this distribution (“what’s the distribution of next word given the previous words):

\begin{equation} W_{n} \sim P(\cdot | w^{(t-1)}, w^{(t-2)}, \dots, w^{(1)}) \end{equation}

By applying the chain rule, we can also think of the language model as assigning a probability to a sequence of words:

\begin{align} P(S) &= P(w^{(t)} | w^{(t-1)}, w^{(t-2)}, \dots, w^{(1)}) \cdot P(w^{(t-1)} | w^{(t-2)}, \dots, w^{(1)}) \dots \\ &= P(w^{(t)}, w^{(t-1)}, w^{(t-2)}, \dots, w^{(1)}) \end{align}

philosophizing

A striking example for the generality of prediction.

N-Gram LM

We can use the Markov assumption: that a word only dependents on the last \(n-1\) terms.

By doing this, we can write:

\begin{align} P(x_{t} | x_{t-1}, .. x_{0}) &\approx P(x_{t} | x_{t-1}, .. x_{t-n-1}) \\ &= \frac{P(x_{t}, x_{t-1}, .. x_{t-n-1})}{P(x_{t-1}, .. x_{t-n-1})} \end{align}

and we can then just use the counts to make this prediction.

Problems

Key idea: language is extremely sparse

any OOV Words has the probability of zero, making it sad; we can kinda solve this by Laplace Smoothing but not really.
any given condition is hard to find; we can kinda solve this by Stupid Backoff

Conflicting demands: more context is better, but with more context had more sparseness problems.

Neural LM

Same task; we want a model for

\begin{equation} W_{n} \sim P(\cdot | w^{(t-1)}, w^{(t-2)}, \dots, w^{(1)}) \end{equation}

We use the same Markov assumption, and then use a fixed local window of text to try to predict the next word.

Naively learning from concatenated word vectors is bad, because that’s weight are not length agnostic—that is, the position in the context in which a word occurs is very sensitive.

Recurrent Neural Network

Key RNN idea — apply the same weights \(W\) repeatedly. Each of our hidden state would be the sum of the previous hidden state by a matrix, and multiply the new word by a matrix. So each step:

\begin{equation} h_{t} = \sigma \qty( W_{h}h_{t-1} + W_{e} e_{t} + b) \end{equation}

Our LM input uses “Teacher Forcing”: i.e. we only generate a single input.

Usually we define our loss as a multilpying the loss with \(\frac{1}{T}\) where \(T\) is the token count per doc.

Problems

slow
hard to look into the past dramatically

Gradient

“gradient w.r.t. a repeated weight is the sum of the gradient for each time where it appears”

Recall that gradient sum at outer branches, so calculate gradient at each step.

Sometimes you would give up after backdropping \(n\) steps (not all the way into history.

\begin{equation} \pdv{J}{W_{h}} = \sum_{i=1}^{t} \pdv{J^{(t)}}{W_{h}} \end{equation}

Making a LM

collect a large training set from web, books, etc.
prepare data is next-word prediction examples
train as large as a neural network as you can

Seminal paper: (Radford et al., n.d.), established SOTA results on known benchmarks.