SU-CS224N APR112024

Last edited: August 8, 2025

Linguistic Structure

Humans somehow turn linear into complex meaning with bigger, non-linear units. We need to make explicit this structural complexity. Sometimes, this is even ambiguous.

We can use this to extract information from human languages.

Why is this hard?

coding: global clarity, local ambiguity (number of white spaces doesn’t matter, but code always have one exact meaning)
speaking: global ambiguity, local clarity (words are always clearly said, but what they refer to maybe unclear)

Prepositional Ambiguity

Why? — Prepositional Phrase does not have clear attachment. The sequence of possible attachments grows exponentially.

SU-CS224N APR162024

Last edited: August 8, 2025

Why do Neural Nets Work Suddenly?

Regularization

see regularization

We want to be able to manipulate our parameters so that our models learn better—for instance, we want our weights to be low:

\begin{equation} J_{L2}(\theta) = J_{reg}(\theta) + \lambda \sum_{k}^{} \theta^{2}_{k} \end{equation}

or good ‘ol dropout—“fetaure dependent regularization”

Motivation

classic view: regularization works to prevent overfitting when we have a lot of features
NEW view with big models: regularization produces generalizable models when parameter count is big enough

Dropout

Dropout: prevents feature co-adaptation => results in good regularization

SU-CS224N APR182024

Last edited: August 8, 2025

perplexity

see perplexity

Vanishing Gradients

Consider how an RNN works: as your sequence gets longer, the earlier layers gets very little gradients because you have to multiply the gradient of each layer by the other.

Alternatively, if the gradient is very large, the parameter updates can blow up exponentially as well if your weights are too large (its either exponentially small or exponentially huge).

Why is this a problem?

To some extent, you can consider that we should tune the nearby weights a lot more than stuff way earlier than the sequence. Ham-fisting, we roughly have 7 tokens worth of effective conditioning.

SU-CS224N APR232024

Last edited: August 8, 2025

Evaluating Machine Translation

BLEU

Compare machine vs. multiple-human reference translations. Uses N-Gram geometric mean—the actual n gram size isn’t super special.

Original idea to have multiple reference translations—but maybe people to do this only one reference translation—good score in expectation.

Limitations

good translation can get a bad BLEU because it has low n gram overlap
penalty to too-short system translations (i.e. translating only easy sentences isn’t a good metric)
you really can’t get to 100 in BLEU because of variations in text

attention

Given a vector of values, a vector query, attention is a technique to compute a weighted sum of the values depending on the query.

SU-CS224N APR252024

Last edited: August 8, 2025

Transformers

Motivation

Lower Sequence-Length Time Complexity

Minimize Linear Interaction Distance

The interaction distances scale by \(O(l)\) with \(l\) sequence length—gradient is affected by linear interaction distance: linear order is baked in.

Maximize Parallelization

Forward and backward passes require waiting (waiting for it to roll from left to right)—-instead, you can compute attention in parallel.

Key Advantage

Maximum interaction distance is \(O(1)\) — each word is connected to each other word
Unparallizable operation does not increase by sequence length

Self-Attention

Self-attention is formulated as each word in a sequence attending to each word in the same sequence.