mutual information
Last edited: August 8, 2025mutual information a measure of the dependence of two random variables in information theory. Applications include collocation extraction, which would require finding how two words co-occur (which means one would contribute much less entropy than the other.)
constituents
- \(X, Y\) random variables
- \(D_{KL}\) KL Divergence function
- \(P_{(X,Y)}\) the joint distribution of \(X,Y\)
- \(P_{X}, P_{Y}\) the marginal distributions of \(X,Y\)
requirements
mutual information is defined as
\begin{equation} I(X ; Y) = D_{KL}(P_{ (X, Y) } | P_{X} \otimes P_{Y}) \end{equation}
mutually exclusive
Last edited: August 8, 2025probability of “or”
If its not possible for two events to happen at the same time, they are called mutually exclusive:
\begin{equation} P(E\ or\ F) = P(E)+P(F) - P(E \cap F) \end{equation}
This is called the inclusion exclusion principle. This is what motivates inclusion exclusion counting.
General inclusion exclusion principle
Its scary. Think about this:

We basically need to alternate between adding and subtracting. (i.e.: in our case here, we add all the odd-group pairs (for P(x) and P(xyz)), we subtract the even-number pairs (for p(xy))).
My Day
Last edited: August 8, 2025Myhill-Nerode Theorem
Last edited: August 8, 2025entire characterization of regular languages: provide necessary and sufficient conditions for regular languages
Statement: a language \(L\) is regular IFF the number of equivalence classes of \(\equiv_{L}\) is finite
corollary
\(L\) is not regular IFF there are infinitely many equivalance classes of \(\equiv_{L}\), meaning there are infinitely many strings $w_1, w_2, …$ such that \(w_{i} \neq w_{j}\) and those strings are also distinguishable to \(L\) meaning, there is at least one \(z \in \Sigma^{*}\) such that exactly one of \(w_{i}z\) and \(w_{j}z\) is in \(L\).
N-Grams
Last edited: August 8, 2025Main goals: assign a probability of each sequence of words existing:
\begin{equation} P(W) = P(w_1, \dots, w_{n}) \end{equation}
closely related is the NLG formulation of predicting an upcoming word:
\begin{equation} P(w_5|w_1, \dots, w_{n}) \end{equation}
either of these we call a “grammar”, or “Language Model”.
Chain Rule Language Modeling
Recall probability chain rule. Now, the probability of a sequence like:
\begin{equation} P(its\ water\ is\ so\ transparent) \end{equation}
gives:
\begin{equation} P(its) \times P(water|its) \times P(is | its\ water) \dots \end{equation}