## Evaluating Machine Translation

### BLEU

Compare machine vs. multiple-human reference translations. Uses N-Gram geometric mean—the actual n gram size isn’t super special.

Original idea to have **multiple reference translations**—but maybe people to do this only one reference translation—good score **in expectation**.

#### Limitations

- good translation can get a bad BLEU because it has low n gram overlap
- penalty to too-short system translations (i.e. translating only easy sentences isn’t a good metric)
- you really can’t get to 100 in BLEU because of variations in text

## attention

Given a vector of **values**, a vector **query**, attention is a technique to compute a weighted sum of the values depending on the query.

### motivation

machine translation problem—naive LSTM implementation has to stuff the entire information about a sentence into a single ending vector.

- improves performance
- more human like model for the MT
- solves the bottleneck problem
- helps solving Vanishing Gradients
- interoperability — provides soft phrase-level alignments, and know what is being translated

### implementation

**each step of the decoder, we will insert direct connections to the encoder to look at particular parts of the input source sequence**

dot every output state against every input state, softmax and add against the source sequence input.

with encoder \(h_{j}\) and decoder \(s_{k}\):

#### dot product attention

\begin{align} e_{i} = s^{T} h_{i} \end{align}

**limitation**: LSTM latent layers are a little bit too busy—some of the information is not as useful as others—also forces everything to have dimension-to-dimension match

#### multiplicative attention

“learn a map from encoder vectors to decoder vectors—working out the right place to pay attention by learning it”

\begin{equation} e_{i} = s^{T} W h_{i} \end{equation}

**limitation**: lots of parameters to learn in \(W\) for no good reason

#### reduced-rank multiplicative attention

\begin{equation} e_{i} = s^{T} Q R h_{i} = (Q s^{T})^{T} (R h_{i}) \end{equation}

essentially, why don’t we project \(s\) and \(h\) down to smaller dimensions before the dot product is taken?

this motivates also transformers

#### additive attention

\begin{equation} e_{i} = v^{T} \text{tanh} \qty(W_1 h_{i} + W_{2} s) \end{equation}

where \(v\) and \(W_{j}\) are learned.