Tiago Forte

Last edited: August 8, 2025

Time Complexity

Last edited: August 8, 2025

Time complexity $t(n): \mathbb{N} \to \mathbb{N}$, and:

\begin{equation} \text{TIME}\qty(t(n)) = \qty {\text{languages } L \mid\ \exists \text{TM that decides $L$ in $O(t(n))$ steps} } \end{equation}

notice! the Big-oh is backed into the definition. And we write:

\begin{equation} P = \bigcup_{c \in \mathbb{N}} \text{TIME}\qty(n^{c}) \end{equation}

we call this “efficient” vs. “feasible”.

\begin{equation} EXP = \bigcup_{c \in \mathbb{N}} \text{TIME}\qty(2^{n^{c}}) \end{equation}

and adding in non-determinism, we have:

\begin{equation} \text{NTIME}\qty(t(n)) = \qty {L : \exists\ \text{NTM that decides L in O(t(n)) steps}} \end{equation}

tokenization

Last edited: August 8, 2025

Every NLP task involve some kind of text normalization.

tokenizing words
normalizing word formats (lemmatize?)
sentence and paragraph segmentation

For Latin, Arabic, Cyrillic, Greek systems, spaces can usually be used for tokenization. Other writing systems can’t do this. See morpheme

Subword Tokenization

Algorithms for breaking up tokens using corpus statistics which acts on lower-than-word level.

BPE
Unigram Language Modeling tokenization
WordPiece

They all work in 2 parst:

a token learner: takes training corpus and derives a vocabulary set
a token segmenter that tokenizes text according to the vocab

top-down parsing

Last edited: August 8, 2025

recursive descent parsing

Given the grammar:

E -> T | T + E
T -> int | int * T | (E)

And suppose our token stream is: (int5) We will start top-level non-terminal character E, and will try to apply the rules for E in order…

(this is exactly how you imagine one builds a parser via recursive backtracking)

when does it break?

Consider S -> S a, this becomes S -> S1 a.

Train a Bert

Last edited: August 8, 2025

So you want to train a Bert? Either your name is david or you stumbled upon this from the internet. well hello. I’m tired and its 1AM so like IDFK if this will be any accurate at all. Oh and if you are reading this because you are an NLPer, I apologize for the notation its 1am.

CLS Tokens

A Bert is a bi-directional transformer encoder model. A Transformer encoder takes a sequence of tokenized input text (each being an embedding), and produces a dense embedding for each token. Hence, for each word vector $w \in W \subset \mathbb{R}^{n}$, a Bert $B$ performs a mapping $\mathcal{L}\qty(W, \mathbb{R}^{m})$ applied onto each input token.