Bag of Words is an order-free representation of a corpus. Specifically, each word has a count which we assign to each word without any other information about ordering, etc.

With the Bayes Theorem, we have:

\begin{equation} C_{MAP} = \arg\max_{c \in C} P(d|c)P( c) \end{equation}

where \(d\) is the document, and \(c\) is the class.

So, given a document is a set of a bunch of words:

\begin{equation} C_{MAP} = \arg\max_{c\in C} P(x_1, \dots, x_{n}|c)P( c) \end{equation}

## Naive Bayes for Text Classification

Assumptions of Bag of Words for Naive Bayes

### \(P( c)\)

The right side is just relative frequencies (count of freq divided by count of class).

### \(P(x_1, …, x_{n})\)

- Naive Bayes assumption (each word’s position doesn’t matter)
- Bag of Words assumption (assume position doesn’t matter)

So we have:

\begin{equation} C_{NB} = \arg\max_{c\in C} P(c_{j}) \prod_{x\in X} P(x|c) \end{equation}

We eventually use logs to prevent underflow:

\begin{equation} C_{NB} = \arg\max_{c\in C}\log P(c_{j}) +\sum_{x\in X} \log P(x|c) \end{equation}

### Parameter Estimation

Because we don’t know new words completely decimating probability, we want to establish a Beta Distribution prior which smoothes the outcomes, meaning:

\begin{equation} P(w_{k}|c_{j}) = \frac{n_{k} + \alpha }{n + \alpha |V|} \end{equation}

where \(n_{k}\) is the number of occurrences of word \(k\) in class \(C\), and \(n\) is the number of words in total that occurs in class \(C\).

### Unknown Words

We ignore them. Because knowing a class has lots of unknown words don’t help.

### Binary Naive Bayes

There is another version, which simply clip all the word count \(n_{k}\) to \(1\) for both train and test. You do this by de-duplicating the entire corpus by **DOCUMENT** (i.e. if a word appears twice in the same document, we count it only once).

### Benefits

- Doesn’t have significant
**fragmentation**problems (i.e. many important features clotting up decision) - Robust to irrelevant features (which cancel each other out)