mutual information

Last edited: August 8, 2025

mutual information a measure of the dependence of two random variables in information theory. Applications include collocation extraction, which would require finding how two words co-occur (which means one would contribute much less entropy than the other.)

constituents

$X, Y$ random variables
$D_{KL}$ KL Divergence function
$P_{(X,Y)}$ the joint distribution of $X,Y$
$P_{X}, P_{Y}$ the marginal distributions of $X,Y$

requirements

mutual information is defined as

\begin{equation} I(X ; Y) = D_{KL}(P_{ (X, Y) } | P_{X} \otimes P_{Y}) \end{equation}

mutually exclusive

Last edited: August 8, 2025

probability of “or”

If its not possible for two events to happen at the same time, they are called mutually exclusive:

\begin{equation} P(E\ or\ F) = P(E)+P(F) - P(E \cap F) \end{equation}

This is called the inclusion exclusion principle. This is what motivates inclusion exclusion counting.

General inclusion exclusion principle

Its scary. Think about this:

We basically need to alternate between adding and subtracting. (i.e.: in our case here, we add all the odd-group pairs (for P(x) and P(xyz)), we subtract the even-number pairs (for p(xy))).

My Day

Last edited: August 8, 2025

Myhill-Nerode Theorem

Last edited: August 8, 2025

entire characterization of regular languages: provide necessary and sufficient conditions for regular languages

Statement: a language $L$ is regular IFF the number of equivalence classes of $\equiv_{L}$ is finite

corollary

$L$ is not regular IFF there are infinitely many equivalance classes of $\equiv_{L}$, meaning there are infinitely many strings $w_1, w_2, …$ such that $w_{i} \neq w_{j}$ and those strings are also distinguishable to $L$ meaning, there is at least one $z \in \Sigma^{*}$ such that exactly one of $w_{i}z$ and $w_{j}z$ is in $L$.

N-Grams

Last edited: August 8, 2025

Main goals: assign a probability of each sequence of words existing:

\begin{equation} P(W) = P(w_1, \dots, w_{n}) \end{equation}

closely related is the NLG formulation of predicting an upcoming word:

\begin{equation} P(w_5|w_1, \dots, w_{n}) \end{equation}

either of these we call a “grammar”, or “Language Model”.

Chain Rule Language Modeling

Recall probability chain rule. Now, the probability of a sequence like:

\begin{equation} P(its\ water\ is\ so\ transparent) \end{equation}

gives:

\begin{equation} P(its) \times P(water|its) \times P(is | its\ water) \dots \end{equation}