_index.org

SU-CS224N APR042024

Last edited: August 8, 2025

stochastic gradient descent

See stochastic gradient descent

Word2Vec

see word2vec

Or, we can even use a simpler approach, window-based co-occurance

GloVe

  • goal: we want to capture linear meaning components in a word vector space correct
  • insight: the ratio of co-occurrence probabilities are linear meaning components

Therefore, GloVe vectors comes from a log-bilinear:

\begin{equation} w_{i} \cdot w_{j} = \log P(i|j) \end{equation}

such that:

\begin{equation} w_{x} \cdot (w_{a} - w_{b}) = \log \frac{P(x|a)}{P(x|b)} \end{equation}

Evaluating a NLP System

Intrinsic

  • evaluate on the specific target task the system is trained on
  • evaluate speed
  • evaluate understandability

Extrinsic

  • real task + attempt to replace older system with new system
  • maybe expensive to compute

Word Sense Ambiguity

Each word may have multiple different meanings; each of those separate word sense should live in a different place. However, words with polysemy have related senses, so we usually average:

SU-CS224N APR112024

Last edited: August 8, 2025

Linguistic Structure

Humans somehow turn linear into complex meaning with bigger, non-linear units. We need to make explicit this structural complexity. Sometimes, this is even ambiguous.

We can use this to extract information from human languages.

Why is this hard?

  • coding: global clarity, local ambiguity (number of white spaces doesn’t matter, but code always have one exact meaning)
  • speaking: global ambiguity, local clarity (words are always clearly said, but what they refer to maybe unclear)

Prepositional Ambiguity

Why? — Prepositional Phrase does not have clear attachment. The sequence of possible attachments grows exponentially.

SU-CS224N APR162024

Last edited: August 8, 2025

Why do Neural Nets Work Suddenly?

Regularization

see regularization

We want to be able to manipulate our parameters so that our models learn better—for instance, we want our weights to be low:

\begin{equation} J_{L2}(\theta) = J_{reg}(\theta) + \lambda \sum_{k}^{} \theta^{2}_{k} \end{equation}

or good ‘ol dropout—“fetaure dependent regularization”

Motivation

  • classic view: regularization works to prevent overfitting when we have a lot of features
  • NEW view with big models: regularization produces generalizable models when parameter count is big enough

Dropout

Dropout: prevents feature co-adaptation => results in good regularization

SU-CS224N APR182024

Last edited: August 8, 2025

perplexity

see perplexity

Vanishing Gradients

Consider how an RNN works: as your sequence gets longer, the earlier layers gets very little gradients because you have to multiply the gradient of each layer by the other.

Alternatively, if the gradient is very large, the parameter updates can blow up exponentially as well if your weights are too large (its either exponentially small or exponentially huge).

Why is this a problem?

To some extent, you can consider that we should tune the nearby weights a lot more than stuff way earlier than the sequence. Ham-fisting, we roughly have 7 tokens worth of effective conditioning.

SU-CS224N APR232024

Last edited: August 8, 2025

Evaluating Machine Translation

BLEU

Compare machine vs. multiple-human reference translations. Uses N-Gram geometric mean—the actual n gram size isn’t super special.

Original idea to have multiple reference translations—but maybe people to do this only one reference translation—good score in expectation.

Limitations

  • good translation can get a bad BLEU because it has low n gram overlap
  • penalty to too-short system translations (i.e. translating only easy sentences isn’t a good metric)
  • you really can’t get to 100 in BLEU because of variations in text

attention

Given a vector of values, a vector query, attention is a technique to compute a weighted sum of the values depending on the query.