## Transformers

### Motivation

#### Lower Sequence-Length Time Complexity

#### Minimize Linear Interaction Distance

The interaction distances scale by \(O(l)\) with \(l\) sequence length—gradient is affected by linear interaction distance: linear order is baked in.

#### Maximize Parallelization

Forward and backward passes require waiting (waiting for it to roll from left to right)—-instead, you can compute attention in parallel.

### Key Advantage

- Maximum interaction distance is \(O(1)\) — each word is connected to each other word
- Unparallizable operation does not increase by sequence length

### Self-Attention

Self-attention is formulated as each word in a sequence attending to each word in the same sequence.

#### Calculating QKV

\begin{equation} \begin{cases} q_{i} = W^{(Q)} x_{i}\\ k_{i} = W^{(K)} x_{i}\\ v_{i} = W^{(V)} x_{i}\\ \end{cases} \end{equation}

and then you have a standard good time using reduced-rank multiplicative attention:

\begin{equation} e_{ij} = q_{i} \cdot k_{j} \end{equation}

and normalize:

\begin{equation} a_{ij} = \text{softmax} (e_{ij}) \end{equation}

to obtain:

\begin{equation} O_{i} = \sum_{j}^{} a_{ij} v_{j} \end{equation}

Vectorized:

\begin{equation} \begin{cases} Q = W^{(Q)} x\\ K = W^{(K)} x\\ V = W^{(V)} x\\ \end{cases} \end{equation}

and

scale dot-product attention

\begin{equation} Out = \text{softmax} \qty(\frac{QK^{\top}}{\sqrt{d_{k}}}) V \end{equation}

why divide by \(\sqrt{d_{k}}\)? see Tricks in Training.

### Transformer Block

Naively having Self-Attention can be described as simply a rolling average. To introduce nonlinearity, we apply a linear layer with a ReLU after.

### Tricks in Training

- skip connections \(x_{l} = F(x_{l-1})+x_{l-1}\)
- layernorm (normalize each layer to mean zero and standard deviation of one, so we protect against lower layer’s distribution shifts) \(x^{(l’)} = \frac{x^{(l)}- \mu^{(l)}}{\sigma^{(l)} + \epsilon}\) we use
**population mean and population standard deviation**- mean of sum is sum of means, meaning after this the input would have mean \(0\) which is good
- yet, the mean of variance is sum of variance, so for dimension \(d\), the resulting one-variant layer becomes d-variant. so, we normalize our attention by \(d_{k}\)

### Word Order

#### Sinusoidal Position Embeddings

No one uses it lol. ABSOLUTE position doesn’t really matter. See Relative Position Embeddings.

#### Relative Position Embeddings

Relative positions are LEARNED and added to the self-attention outputs.

so we learn embeddings

### Multi-Head Attention

Perform attention multiple times, get a series of SA embeddings and concatenate. For each single head, divide by number of heads (so you end up doing the same amonut of computation)

## Transformer Drawbacks

- quadratic compute of self-attention (computing pairs of interaction means that the computation grows
**quadratic**) — linformer, attempts to solve this