- 1990 static word embeddings
- 2003 neural language models
- 2008 multi-task learning
- 2015 attention
- 2017 transformer
- 2018 trainable contextual word embeddings + large scale pretraining
- 2019 prompt engineering

## Motivating Attention

Given a sequence of embeddings: \(x_1, x_2, …, x_{n}\)

For each \(x_{i}\), the goal of attention is to **produce a new embedding** of each \(x_{i}\) named \(a_{i}\) based its dot product similarity with all other words that are before it.

Let’s define:

\begin{equation} score(x_{i}, x_{j}) = x_{i} \cdot x_{j} \end{equation}

Which means that we can write:

\begin{equation} a_{i} = \sum_{j \leq i}^{} \alpha_{i,j} x_{j} \end{equation}

where:

\begin{equation} \alpha_{i,j} = softmax \qty(score(x_{i}, x_{j}) ) \end{equation}

The resulting \(a_{i}\) is the output of our attention.

## Attention

From the above, we call the input embeddings \(x_{j}\) the **values**, and we will create a separate embeddings called **key** with which we will measure the similarity. We call the word we want the target new embeddings for the **query** (i.e. \(x_{i}\) from above).