corpus

Last edited: August 8, 2025

usually we use \(N\) to denote the number of tokens, and \(V\) the “vocab” or set of word types.

Corpora is usually considered in context of:

Particularly hard: code switching, gender, demographics, variety, etc.

Herdan’s Law

\begin{equation} |V| = kN^{\beta} \end{equation}

with \(\beta\) being a constant between \(0.67 < \beta < 0.75\).

The vocab size is roughly proportional to the number of tokens.

Last edited: August 8, 2025

Last edited: August 8, 2025

coulomb’s law is a principle that deals with the force that two charged particles exhibit to each other.

\(k\), Coulomb’s Constant, found roughly to be \(9 \times 10^{9} \frac{N m^{2}}{C}\)
\(q_{1,2}\), the charge of the two particles you are analyzing
\(r\), distance between particles

\begin{equation} \vec{F_{E}} = k \frac{q_1q_2}{r^{2}} \end{equation}

negative: attraction force between changes (the points have opposite signed charges, and so attract)
positive: repulsion force between changes (the point have the same signed change, so repel)

Last edited: August 8, 2025

“if thing didn’t happen would I have…”

Last edited: August 8, 2025

requires manipulating counterfactual information—not what the current known states are, but what are the next possible states.

Inside , there is already a few principles which are counterfactual.

Conservation of energy: a perpetual machine is *impossible
Second law: its impossible to convert all heat into useful work
Heisenberg’s uncertainty: its impossible to copy reliable all states of a qubit

With the impossibles, we can make the possible.