NLP
Last edited: August 8, 2025Coherence
Generative REVOLUTION
Why probability maximization sucks
Its expensive!
Beam Search
- Take \(k\) candidates
- Expand \(k\) expansions for each of the \(k\) candidates
- Choose the highest probability \(k\) candidates
\(k\) should be small: trying to maximizing
Branch and Bound
See Branch and Bound
Challenges of Direct Sampling
Direct Sampling sucks. Its sucks. It sucks. Just sampling from the distribution sucks. This has to do with the fact that assigning slightly lower scores “being less confident” is exponentially worse.
NLP Index
Last edited: August 8, 2025Learning Goals
- Effective modern methods for deep NLP
- Word vectors, FFNN, recurrent networks, attention
- Transformers, encoder/decoder, pre-training, post-training (RLHF, SFT), adaptation, interoperability, agents
- Big picture in HUMAN LANGUAGES
- why are they hard
- why using computers to deal with them are doubly hard
- Making stuff (in PyTorch)
- word meaning
- dependency parsing
- machine translation
- QA
Lectures
NLP Semantics Timeline
Last edited: August 8, 2025- 1990 static word embeddings
- 2003 neural language models
- 2008 multi-task learning
- 2015 attention
- 2017 transformer
- 2018 trainable contextual word embeddings + large scale pretraining
- 2019 prompt engineering
Motivating Attention
Given a sequence of embeddings: \(x_1, x_2, …, x_{n}\)
For each \(x_{i}\), the goal of attention is to produce a new embedding of each \(x_{i}\) named \(a_{i}\) based its dot product similarity with all other words that are before it.
Let’s define:
Noam Chomsky
Last edited: August 8, 2025Non-Deterministic Computation
Last edited: August 8, 2025…building blocks of Non-deterministic Turing Machine. Two transition functions:
\begin{equation} \delta_{0}, \delta_{1} : Q \times \Gamma^{k} \to Q \times \Gamma^{k-1} \times \qty {L, R, S}^{k} \end{equation}
At every point, apply both of these separate functions/branch on both. Some sequences lead to \(q_{\text{accept}}\), and some others lead to \(q_{\text{reject}}\).
We accept IFF exists any path accepts => we reject IFF all path rejects.
why NP is awesome
“what a ridiculous model of computation!”