Main goals: assign a probability of each sequence of words existing:

\begin{equation} P(W) = P(w_1, \dots, w_{n}) \end{equation}

closely related is the NLG formulation of predicting an upcoming word:

\begin{equation} P(w_5|w_1, \dots, w_{n}) \end{equation}

either of these we call a “grammar”, or “Language Model”.

## Chain Rule Language Modeling

Recall probability chain rule. Now, the probability of a sequence like:

\begin{equation} P(its\ water\ is\ so\ transparent) \end{equation}

gives:

\begin{equation} P(its) \times P(water|its) \times P(is | its\ water) \dots \end{equation}

That is:

\begin{equation} P(w_1 \dots w_{n}) = \prod_{i}P(w_{i} | w_1 \dots w_{i-1}) \end{equation}

## Markov Assumption

Because we can’t make conditional counts over all words all the time, we make an assumption: the probability of the current word is the probability of the current word conditioned on the probability of the last \(k\) words.

\begin{equation} P(w_1, \dots, w_{n}) \approx \prod_{i}P(w_{i} | w_{i-k} \dots w_{i-1}) \end{equation}

## Unigrams

The simplest Markov Assumption is unigrams, which will be word salad generation because it has no understanding of language structure.

### Naive Bays Language Modeling

You can consider each class in Naive Bayes \(P(word | c)\) as a language model.

So:

\begin{equation} P(sentence|c) = \prox_{i}P(word_{i}|c) \end{equation}

Each class is a separate class-conditioned language model. So, we just want to compute the probability of each sentence, and classify the sentence based on the higher probability result.

## Limitations

In general, n gram models are limited because they don’t consider long distance dependencies which are present in English.

## Estimating N-Grams

Many counts are results of…

- world (“people want chinese food more often, so want+Chinese appears more”)
- grammar (“want+want is less likely”)

### MLE

\begin{equation} P(w_{i} | w_{i-1}) = \frac{C(w_{i-1}, w_{i})}{C(w_{i-1})} \end{equation}

### MAP, i.e. Laplace Smoothing

\begin{equation} P(w_{i} | w_{i-1}) = \frac{C(w_{i-1}, w_{i})+1}{C(w_{i-1})+V} \end{equation}

we have to add \(V\) on the denominator because every word could possibly follow \(w_{i-1}\). Note that as \(N\) increases we actually still add \(V\) because we are predicting at each time a **single word** (just conditioned on more words), so if we are smoothing output we are only adding \(V\) extra counts.

IMPORTANT NOTE THOUGH: this is typically not used for N-Grams (because there are simply so many OOS sequences). Instead, its more frequently used in other cases such as Naive Bayes for Text Classification.

### Log Probs

In practice, we keep probability as log probabilities after we computed them.

### N-Gram Models

Google n-gram models, SRILM

### Backoff

Use trigrams if high probability evidence is found, otherwise bigrams or unigrams

#### Stupid Backoff

- give the MLE if the conditioning sequence has a non-zero count
- otherwise, start backing off, recursively calculating the probability of the current word given the last n-1-gram, multplied by a discount factor
- if we end up with a unigram, just give the unigram probability

This **DOES NOT PRODUCE A PROBABILITY** as it is not normalized. Instead of being probabilites, we consider them “scores”.

### Interpolation

In practice, Interpolation works better. Interpolation smoothes the probability between unigram, bigram, and trigrams.

Mostly simply, we mix them with some factors \(\lambda_{1}, \lambda_{2}, \lambda_{3}\), where \(\sum_{i} \lambda_{i} = 1\). This makes a weighted average over probabilities:

\begin{equation} P(comb) = \lambda_{1} P(uni) + \lambda_{2} P(bi)+ \lambda_{3} P(tri) \end{equation}

lambdas could also be a function of the previous tokens.

We sometimes obtain this with a disjoint dataset from the original training set, whereby we train some ngrams from the original dataset, and then identify \(\lambda\) which maximises the probabilities.

### OOV Words

we sometimes replace the lowest likelyhood few words with `<UNK>`

, and train models such that we can have an open vocabulary: whenever we encounter unknown words, we replace it with `<UNK>`

## Scaling Up

Strategies to make LMing with Ngrams more efficient

- pruning: only store ngrams of top k
- use tries (suffix trees, etc.)
- approximations: bloom filter
- storing indicies