Houjun Liu


we will train a classifier on a binary prediction task: “is context words \(c_{1:L}\) likely to show up near some target word \(W_0\)?”

We estimate the probability that \(w_{0}\) occurs within this window based on the product of the probabilities of the similarity of the embeddings between each context word and the target word.

To turn cosine similarity dot products into probability, we squish the dot product via the sigmoid function.

importantly, we don’t actually use these results. we simply take the resulting embeddings.


window size

  • smaller windows: captures more syntax level information
  • large windows: capture more semantic field information

parallelogram model

simple way to solve analogies problems with vector semantics: get the difference between two word vectors, and add it somewhere else to get an analogous transformation.

  • only words for frequent words
  • small distances
  • but not quite for large systems

allocational harm

embeddings bake in existing biases, which leads to bias in hiring practices, etc.

skip-gram with negative sampling

skip-gram trains vectors separately for word being used as target and word being used as context.

the mechanism for training the embedding:

  • select some \(k\), which is the multiplier of the negative examples (if \(k=2\), ever one positive example will be matched with 2 negative examples)
  • sample a target word, and generate positive samples paired by words in its immediate window
  • sample window size times \(k\) negative examples, where the noise words are chosen explicitly as not being near our target word, and weighted based on unigram frequency

for each paired training sample, we minimize the loss via cross entropy loss:

\begin{equation} L_{CE} = -\qty[ \log (\sigma(c_{pos} \cdot w)) + \sum_{i=1}^{k} \log \sigma\qty(-c_{neg} \cdot w)] \end{equation}

recall that:

\begin{equation} \pdv{L_{CE}}{w} = \qty[\sigma(c_{pos} \cdot w) -1]c_{pos} + \sum_{i=1}^{k} \qty[\sigma(c_{neg_{i}}\cdot w)]c_{neg_{i}} \end{equation}