SU-CS224N APR042024
Last edited: August 8, 2025stochastic gradient descent
See stochastic gradient descent
Word2Vec
see word2vec
Or, we can even use a simpler approach, window-based co-occurance
GloVe
- goal: we want to capture linear meaning components in a word vector space correct
- insight: the ratio of co-occurrence probabilities are linear meaning components
Therefore, GloVe vectors comes from a log-bilinear:
\begin{equation} w_{i} \cdot w_{j} = \log P(i|j) \end{equation}
such that:
\begin{equation} w_{x} \cdot (w_{a} - w_{b}) = \log \frac{P(x|a)}{P(x|b)} \end{equation}
Evaluating a NLP System
Intrinsic
- evaluate on the specific target task the system is trained on
- evaluate speed
- evaluate understandability
Extrinsic
- real task + attempt to replace older system with new system
- maybe expensive to compute
Word Sense Ambiguity
Each word may have multiple different meanings; each of those separate word sense should live in a different place. However, words with polysemy have related senses, so we usually average:
SU-CS224N APR092024
Last edited: August 8, 2025Neural Networks are powerful because of self organization of the intermediate levels.
Neural Network Layer
\begin{equation} z = Wx + b \end{equation}
for the output, and the activations:
\begin{equation} a = f(z) \end{equation}
where the activation function \(f\) is applied element-wise.
Why are NNs Non-Linear?
- there’s no representational power with multiple linear (though, there is better learning/convergence properties even with big linear networks!)
- most things are non-linear!
Activation Function
We want non-linear and non-threshold (0/1) activation functions because it has a slope—meaning we can perform gradient-based learning.
SU-CS224N APR112024
Last edited: August 8, 2025Linguistic Structure
Humans somehow turn linear into complex meaning with bigger, non-linear units. We need to make explicit this structural complexity. Sometimes, this is even ambiguous.
We can use this to extract information from human languages.
Why is this hard?
- coding: global clarity, local ambiguity (number of white spaces doesn’t matter, but code always have one exact meaning)
- speaking: global ambiguity, local clarity (words are always clearly said, but what they refer to maybe unclear)
Prepositional Ambiguity
Why? — Prepositional Phrase does not have clear attachment. The sequence of possible attachments grows exponentially.
SU-CS224N APR162024
Last edited: August 8, 2025Why do Neural Nets Work Suddenly?
Regularization
see regularization
We want to be able to manipulate our parameters so that our models learn better—for instance, we want our weights to be low:
\begin{equation} J_{L2}(\theta) = J_{reg}(\theta) + \lambda \sum_{k}^{} \theta^{2}_{k} \end{equation}
or good ‘ol dropout—“fetaure dependent regularization”
Motivation
- classic view: regularization works to prevent overfitting when we have a lot of features
- NEW view with big models: regularization produces generalizable models when parameter count is big enough
Dropout
Dropout: prevents feature co-adaptation => results in good regularization
SU-CS224N APR182024
Last edited: August 8, 2025perplexity
see perplexity
Vanishing Gradients
Consider how an RNN works: as your sequence gets longer, the earlier layers gets very little gradients because you have to multiply the gradient of each layer by the other.
Alternatively, if the gradient is very large, the parameter updates can blow up exponentially as well if your weights are too large (its either exponentially small or exponentially huge).
Why is this a problem?
To some extent, you can consider that we should tune the nearby weights a lot more than stuff way earlier than the sequence. Ham-fisting, we roughly have 7 tokens worth of effective conditioning.