An LM predicts this distribution (“what’s the distribution of next word given the previous words):
\begin{equation} W_{n} \sim P(\cdot | w^{(t-1)}, w^{(t-2)}, \dots, w^{(1)}) \end{equation}
By applying the chain rule, we can also think of the language model as assigning a probability to a sequence of words:
\begin{align} P(S) &= P(w^{(t)} | w^{(t-1)}, w^{(t-2)}, \dots, w^{(1)}) \cdot P(w^{(t-1)} | w^{(t-2)}, \dots, w^{(1)}) \dots \\ &= P(w^{(t)}, w^{(t-1)}, w^{(t-2)}, \dots, w^{(1)}) \end{align}
N-Gram LM
We can use the Markov assumption: that a word only dependents on the last \(n-1\) terms.
By doing this, we can write:
\begin{align} P(x_{t} | x_{t-1}, .. x_{0}) &\approx P(x_{t} | x_{t-1}, .. x_{t-n-1}) \\ &= \frac{P(x_{t}, x_{t-1}, .. x_{t-n-1})}{P(x_{t-1}, .. x_{t-n-1})} \end{align}
and we can then just use the counts to make this prediction.
Problems
Key idea: language is extremely sparse
- any OOV Words has the probability of zero, making it sad; we can kinda solve this by Laplace Smoothing but not really.
- any given condition is hard to find; we can kinda solve this by Stupid Backoff
Conflicting demands: more context is better, but with more context had more sparseness problems.
Neural LM
Same task; we want a model for
\begin{equation} W_{n} \sim P(\cdot | w^{(t-1)}, w^{(t-2)}, \dots, w^{(1)}) \end{equation}
We use the same Markov assumption, and then use a fixed local window of text to try to predict the next word.
Naively learning from concatenated word vectors is bad, because that’s weight are not length agnostic—that is, the position in the context in which a word occurs is very sensitive.
Recurrent Neural Network
Key RNN idea — apply the same weights \(W\) repeatedly. Each of our hidden state would be the sum of the previous hidden state by a matrix, and multiply the new word by a matrix. So each step:
\begin{equation} h_{t} = \sigma \qty( W_{h}h_{t-1} + W_{e} e_{t} + b) \end{equation}
Our LM input uses “Teacher Forcing”: i.e. we only generate a single input.
Usually we define our loss as a multilpying the loss with \(\frac{1}{T}\) where \(T\) is the token count per doc.
Problems
- slow
- hard to look into the past dramatically
Gradient
“gradient w.r.t. a repeated weight is the sum of the gradient for each time where it appears”
Recall that gradient sum at outer branches, so calculate gradient at each step.
Sometimes you would give up after backdropping \(n\) steps (not all the way into history.
\begin{equation} \pdv{J}{W_{h}} = \sum_{i=1}^{t} \pdv{J^{(t)}}{W_{h}} \end{equation}