word2vec

we will train a classifier on a binary prediction task: “is context words \(c_{1:L}\) likely to show up near some target word \(W_0\)?”

We estimate the probability that \(w_{0}\) occurs within this window based on the product of the probabilities of the similarity of the embeddings between each context word and the target word.

we have a corpus of text
each word is represented by a vector
go through each position \(t\) in the text, which has a center word \(c\) and set of context words \(o \in O\)
use similarity of word vectors \(c\) and \(o\) to calculate \(P(o|c)\)

Meaning, we want to devise a model which can predict high probabilities \(P(w_{t-n}|w_{t})\) for small \(n\) and low probabilities for large \(n\)

Word2Vec is a Bag of Words model!

This is a Bag of Words model as our training does not learn any information relating to the ordering and structure between words.

Likelihood

If we wrote the above out:

\begin{equation} L(\theta) = \prod_{t=1}^{T} \prod_{-m \leq j \leq m, j\neq 0}^{} p_{\theta}(w_{t+j} | w_{t}) \end{equation}

Calculating \(p_{\theta}\)

We are going to use TWO VECTORS for each word:

\(v_{w}\) when \(w\) is the center word
and \(u_{w}\) when \(w\) is a context words

These vectors are the only parameters of our system. We actually do this only to make the math easy; to get the “word vector” for a word by averaging.

Therefore:

\begin{equation} p(o|c) = \frac{\exp\qty(u_{o} \cdot v_{c})}{ \sum_{w \in V}^{} \exp \qty(u_{w} \cdot v_{c})} \end{equation}

exponentiation makes anything positive
normalize over the entire vocabulary

this is a softmax operation.

Objective Function

But we perform:

descent
and log on each value to prevent underflow
and average why not

\begin{equation} J(\theta) = \frac{1}{T} \log L(\theta) = -\frac{1}{T} \sum_{t=1}^{T} \sum_{-m \leq j \leq m, j\neq 0}^{} \log p_{\theta}\qty(w_{t+j} | w_{t}) \end{equation}

Recall that:

\begin{equation} p(o|c) = \frac{\exp\qty(u_{o} \cdot v_{c})}{ \sum_{w \in V}^{} \exp \qty(u_{w} \cdot v_{c})} \end{equation}

Because we need to minimize this, we need the derivative of it by the parameter:

\begin{align} \pdv{J}{w_{t}} &= -\frac{1}{T} \sum_{t=1}^{T} \sum_{-m \leq j \leq m, j\neq 0}^{} \pdv w_{t} \log p_{\theta}\qty(w_{t+j} | w_{t}) \end{align}

meaning, we now can just calculate the inner part:

\begin{equation} \pdv{\log p(o|c)}{v_{c}} = \pdv v_{c} \log \exp u_{o} \cdot v_{c} - \pdv v_{c} \log \sum_{w_{j}}^{} \exp \qty(u_{w} \cdot v_{c}) \end{equation}

Look! The first part is a log of an exp, which cancels out, so the derivative is just \(u_{0}\).

For the right part, by the chain rule:

\begin{align} \pdv v_{c} \log \sum_{w_{j}}^{} \exp \qty(u_{w} \cdot v_{c}) &= \frac{\sum_{x_{j}}^{} \pdv v_{c}\exp \qty(u_{x} \cdot v_{c}) }{\sum_{w_{j}}^{} \exp \qty(u_{w} \cdot v_{c})} \\ &= \frac{\sum_{x_{j}}^{}\exp \qty(u_{x} \cdot v_{c}) u_{x}}{\sum_{w_{j}}^{} \exp \qty(u_{w} \cdot v_{c})} \end{align}

Combining this whole thing, we have:

\begin{equation} \pdv{\log p(o|c)}{v_{c}} = u_{0} - \frac{\sum_{x_{j}}^{}\exp \qty(u_{x} \cdot v_{c}) u_{x}}{\sum_{w_{j}}^{} \exp \qty(u_{w} \cdot v_{c})} \end{equation}

Rewriting this slightly:

\begin{equation} \pdv{\log p(o|c)}{v_{c}} = u_{0} - \sum_{x_{j}}^{}\frac{\exp \qty(u_{x} \cdot v_{c}) }{\sum_{w_{j}}^{} \exp \qty(u_{w} \cdot v_{c})} u_{x} \end{equation}

Meaning:

\begin{equation} \pdv{\log p(o|c)}{v_{c}} = u_{0} - \sum_{x_{j}}^{}\frac{\exp \qty(u_{x} \cdot v_{c}) }{\sum_{w_{j}}^{} \exp \qty(u_{w} \cdot v_{c})} u_{x} \end{equation}

The right side is just the softmax probabilty of each \(u_{x}\) times \(u_{x}\), meaning its \(\mathbb{E}[u_{x}]\); so, this loss just minimizes “error between output and expectation”.

Word2Vec Variants

Model

skip-gram—predict probability of being side words \(P(o|c)\)
CBOW—predict probability of being center word given side words

Objective

naive softmax (above)
hierachichar softmax
negative sampling (see also skip-gram with negative sampling)

properties

window size

smaller windows: captures more syntax level information
large windows: capture more semantic field information

parallelogram model

simple way to solve analogies problems with vector semantics: get the difference between two word vectors, and add it somewhere else to get an analogous transformation.

only words for frequent words
small distances
but not quite for large systems

allocational harm

embeddings bake in existing biases, which leads to bias in hiring practices, etc.

skip-gram with negative sampling

skip-gram trains vectors separately for word being used as target and word being used as context.

the mechanism for training the embedding:

select some \(k\), which is the count of negative examples (if \(k=2\), every one positive example will be matched with 2 negative examples)
sample a target word, and generate positive samples paired by words in its immediate window
sample window size times \(k\) negative examples, where the noise words are chosen explicitly as not being near our target word, and weighted based on unigram frequency

for each paired training sample, we minimize the loss via binary cross entropy loss:

\begin{equation} L_{CE} = -\qty[ \log (\sigma(c_{pos} \cdot w)) + \sum_{i=1}^{k} \log \sigma\qty(-c_{neg} \cdot w)] \end{equation}

recall that:

\begin{equation} \pdv{L_{CE}}{w} = \qty[\sigma(c_{pos} \cdot w) -1]c_{pos} + \sum_{i=1}^{k} \qty[\sigma(c_{neg_{i}}\cdot w)]c_{neg_{i}} \end{equation}

Importantly, because the softmax function is symmetric \(\sigma(-x) = -\sigma(x)\). So really our objective is:

\begin{equation} L_{CE} = -\qty[ \log (\sigma(c_{pos} \cdot w)) - \sum_{i=1}^{k} \log \sigma\qty(c_{neg} \cdot w)] \end{equation}

how to sample \(k\)

We actually sample from:

\begin{equation} P(w) \sim U(w)^{3/4}/Z \end{equation}

to give the less common words slightly higher probability.