Posts

mutual information

Last edited: August 8, 2025

mutual information a measure of the dependence of two random variables in information theory. Applications include collocation extraction, which would require finding how two words co-occur (which means one would contribute much less entropy than the other.)

constituents

requirements

mutual information is defined as

\begin{equation} I(X ; Y) = D_{KL}(P_{ (X, Y) } | P_{X} \otimes P_{Y}) \end{equation}

mutually exclusive

Last edited: August 8, 2025

probability of “or”

If its not possible for two events to happen at the same time, they are called mutually exclusive:

\begin{equation} P(E\ or\ F) = P(E)+P(F) - P(E \cap F) \end{equation}

This is called the inclusion exclusion principle. This is what motivates inclusion exclusion counting.

General inclusion exclusion principle

Its scary. Think about this:

We basically need to alternate between adding and subtracting. (i.e.: in our case here, we add all the odd-group pairs (for P(x) and P(xyz)), we subtract the even-number pairs (for p(xy))).

My Day

Last edited: August 8, 2025

Myhill-Nerode Theorem

Last edited: August 8, 2025

entire characterization of regular languages: provide necessary and sufficient conditions for regular languages

Statement: a language \(L\) is regular IFF the number of equivalence classes of \(\equiv_{L}\) is finite

corollary

\(L\) is not regular IFF there are infinitely many equivalance classes of \(\equiv_{L}\), meaning there are infinitely many strings $w_1, w_2, …$ such that \(w_{i} \neq w_{j}\) and those strings are also distinguishable to \(L\) meaning, there is at least one \(z \in \Sigma^{*}\) such that exactly one of \(w_{i}z\) and \(w_{j}z\) is in \(L\).

N-Grams

Last edited: August 8, 2025

Main goals: assign a probability of each sequence of words existing:

\begin{equation} P(W) = P(w_1, \dots, w_{n}) \end{equation}

closely related is the NLG formulation of predicting an upcoming word:

\begin{equation} P(w_5|w_1, \dots, w_{n}) \end{equation}

either of these we call a “grammar”, or “Language Model”.

Chain Rule Language Modeling

Recall probability chain rule. Now, the probability of a sequence like:

\begin{equation} P(its\ water\ is\ so\ transparent) \end{equation}

gives:

\begin{equation} P(its) \times P(water|its) \times P(is | its\ water) \dots \end{equation}