NLP

Last edited: August 8, 2025

Complex System

Coherence

Generative REVOLUTION

Why probability maximization sucks

Its expensive!

Beam Search

Take \(k\) candidates
Expand \(k\) expansions for each of the \(k\) candidates
Choose the highest probability \(k\) candidates

\(k\) should be small: trying to maximizing

Branch and Bound

See Branch and Bound

Challenges of Direct Sampling

Direct Sampling sucks. Its sucks. It sucks. Just sampling from the distribution sucks. This has to do with the fact that assigning slightly lower scores “being less confident” is exponentially worse.

NLP Index

Last edited: August 8, 2025

Learning Goals

Effective modern methods for deep NLP
- Word vectors, FFNN, recurrent networks, attention
- Transformers, encoder/decoder, pre-training, post-training (RLHF, SFT), adaptation, interoperability, agents
Big picture in HUMAN LANGUAGES
- why are they hard
- why using computers to deal with them are doubly hard
Making stuff (in PyTorch)
- word meaning
- dependency parsing
- machine translation
- QA

Lectures

NLP Semantics Timeline

Last edited: August 8, 2025

1990 static word embeddings
2003 neural language models
2008 multi-task learning
2015 attention
2017 transformer
2018 trainable contextual word embeddings + large scale pretraining
2019 prompt engineering

Motivating Attention

Given a sequence of embeddings: \(x_1, x_2, …, x_{n}\)

For each \(x_{i}\), the goal of attention is to produce a new embedding of each \(x_{i}\) named \(a_{i}\) based its dot product similarity with all other words that are before it.

Let’s define:

Noam Chomsky

Last edited: August 8, 2025

Non-Deterministic Computation

Last edited: August 8, 2025

…building blocks of Non-deterministic Turing Machine. Two transition functions:

\begin{equation} \delta_{0}, \delta_{1} : Q \times \Gamma^{k} \to Q \times \Gamma^{k-1} \times \qty {L, R, S}^{k} \end{equation}

At every point, apply both of these separate functions/branch on both. Some sequences lead to \(q_{\text{accept}}\), and some others lead to \(q_{\text{reject}}\).

We accept IFF exists any path accepts => we reject IFF all path rejects.

why NP is awesome

“what a ridiculous model of computation!”