Transformers
Motivation
Lower Sequence-Length Time Complexity
Minimize Linear Interaction Distance
The interaction distances scale by \(O(l)\) with \(l\) sequence length—gradient is affected by linear interaction distance: linear order is baked in.
Maximize Parallelization
Forward and backward passes require waiting (waiting for it to roll from left to right)—-instead, you can compute attention in parallel.
Key Advantage
- Maximum interaction distance is \(O(1)\) — each word is connected to each other word
- Unparallizable operation does not increase by sequence length
Self-Attention
Self-attention is formulated as each word in a sequence attending to each word in the same sequence.
Calculating QKV
\begin{equation} \begin{cases} q_{i} = W^{(Q)} x_{i}\\ k_{i} = W^{(K)} x_{i}\\ v_{i} = W^{(V)} x_{i}\\ \end{cases} \end{equation}
and then you have a standard good time using reduced-rank multiplicative attention:
\begin{equation} e_{ij} = q_{i} \cdot k_{j} \end{equation}
and normalize:
\begin{equation} a_{ij} = \text{softmax} (e_{ij}) \end{equation}
to obtain:
\begin{equation} O_{i} = \sum_{j}^{} a_{ij} v_{j} \end{equation}
Vectorized:
\begin{equation} \begin{cases} Q = W^{(Q)} x\\ K = W^{(K)} x\\ V = W^{(V)} x\\ \end{cases} \end{equation}
and
- scale dot-product attention - \begin{equation} Out = \text{softmax} \qty(\frac{QK^{\top}}{\sqrt{d_{k}}}) V \end{equation} - why divide by \(\sqrt{d_{k}}\)? see Tricks in Training. 
Transformer Block
Naively having Self-Attention can be described as simply a rolling average. To introduce nonlinearity, we apply a linear layer with a ReLU after.
Tricks in Training
- skip connections \(x_{l} = F(x_{l-1})+x_{l-1}\)
- layernorm (normalize each layer to mean zero and standard deviation of one, so we protect against lower layer’s distribution shifts) \(x^{(l’)} = \frac{x^{(l)}- \mu^{(l)}}{\sigma^{(l)} + \epsilon}\) we use population mean and population standard deviation- mean of sum is sum of means, meaning after this the input would have mean \(0\) which is good
- yet, the mean of variance is sum of variance, so for dimension \(d\), the resulting one-variant layer becomes d-variant. so, we normalize our attention by \(d_{k}\)
 
Word Order
Sinusoidal Position Embeddings
No one uses it lol. ABSOLUTE position doesn’t really matter. See Relative Position Embeddings.
Relative Position Embeddings
Relative positions are LEARNED and added to the self-attention outputs.
so we learn embeddings
Multi-Head Attention
Perform attention multiple times, get a series of SA embeddings and concatenate. For each single head, divide by number of heads (so you end up doing the same amonut of computation)
Transformer Drawbacks
- quadratic compute of self-attention (computing pairs of interaction means that the computation grows quadratic) — linformer, attempts to solve this
