SU-CS224N APR022024

Last edited: August 8, 2025

Why Language

language, first, allows communication (which allowed us to take over the world)
language allows humans to achieve higher level thoughts (it scaffolds detailed planning)
language is also a flexible system which allows variatically precise communication

“The common misconception is that language use has to do with words and what they mean; instead, language use has to do with people and what they mean.”

Timeline of Development

2014 - Neural Machine Translation

Deep Google Translate allows wider communication and understanding

SU-CS224N APR042024

Last edited: August 8, 2025

stochastic gradient descent

See stochastic gradient descent

Word2Vec

see word2vec

Or, we can even use a simpler approach, window-based co-occurance

GloVe

goal: we want to capture linear meaning components in a word vector space correct
insight: the ratio of co-occurrence probabilities are linear meaning components

Therefore, GloVe vectors comes from a log-bilinear:

\begin{equation} w_{i} \cdot w_{j} = \log P(i|j) \end{equation}

such that:

\begin{equation} w_{x} \cdot (w_{a} - w_{b}) = \log \frac{P(x|a)}{P(x|b)} \end{equation}

Evaluating a NLP System

Intrinsic

evaluate on the specific target task the system is trained on
evaluate speed
evaluate understandability

Extrinsic

real task + attempt to replace older system with new system
maybe expensive to compute

Word Sense Ambiguity

Each word may have multiple different meanings; each of those separate word sense should live in a different place. However, words with polysemy have related senses, so we usually average:

SU-CS224N APR112024

Last edited: August 8, 2025

Linguistic Structure

Humans somehow turn linear into complex meaning with bigger, non-linear units. We need to make explicit this structural complexity. Sometimes, this is even ambiguous.

We can use this to extract information from human languages.

Why is this hard?

coding: global clarity, local ambiguity (number of white spaces doesn’t matter, but code always have one exact meaning)
speaking: global ambiguity, local clarity (words are always clearly said, but what they refer to maybe unclear)

Prepositional Ambiguity

Why? — Prepositional Phrase does not have clear attachment. The sequence of possible attachments grows exponentially.

SU-CS224N APR162024

Last edited: August 8, 2025

Why do Neural Nets Work Suddenly?

Regularization

see regularization

We want to be able to manipulate our parameters so that our models learn better—for instance, we want our weights to be low:

\begin{equation} J_{L2}(\theta) = J_{reg}(\theta) + \lambda \sum_{k}^{} \theta^{2}_{k} \end{equation}

or good ‘ol dropout—“fetaure dependent regularization”

Motivation

classic view: regularization works to prevent overfitting when we have a lot of features
NEW view with big models: regularization produces generalizable models when parameter count is big enough

Dropout

Dropout: prevents feature co-adaptation => results in good regularization

SU-CS224N APR182024

Last edited: August 8, 2025

perplexity

see perplexity

Vanishing Gradients

Consider how an RNN works: as your sequence gets longer, the earlier layers gets very little gradients because you have to multiply the gradient of each layer by the other.

Alternatively, if the gradient is very large, the parameter updates can blow up exponentially as well if your weights are too large (its either exponentially small or exponentially huge).

Why is this a problem?

To some extent, you can consider that we should tune the nearby weights a lot more than stuff way earlier than the sequence. Ham-fisting, we roughly have 7 tokens worth of effective conditioning.