Houjun Liu

SU-CS224N APR232024

Evaluating Machine Translation

BLEU

Compare machine vs. multiple-human reference translations. Uses N-Gram geometric mean—the actual n gram size isn’t super special.

Original idea to have multiple reference translations—but maybe people to do this only one reference translation—good score in expectation.

Limitations

  • good translation can get a bad BLEU because it has low n gram overlap
  • penalty to too-short system translations (i.e. translating only easy sentences isn’t a good metric)
  • you really can’t get to 100 in BLEU because of variations in text

attention

Given a vector of values, a vector query, attention is a technique to compute a weighted sum of the values depending on the query.

motivation

machine translation problem—naive LSTM implementation has to stuff the entire information about a sentence into a single ending vector.

  • improves performance
  • more human like model for the MT
  • solves the bottleneck problem
  • helps solving Vanishing Gradients
  • interoperability — provides soft phrase-level alignments, and know what is being translated

implementation

each step of the decoder, we will insert direct connections to the encoder to look at particular parts of the input source sequence

dot every output state against every input state, softmax and add against the source sequence input.

with encoder \(h_{j}\) and decoder \(s_{k}\):

dot product attention

\begin{align} e_{i} = s^{T} h_{i} \end{align}

limitation: LSTM latent layers are a little bit too busy—some of the information is not as useful as others—also forces everything to have dimension-to-dimension match

multiplicative attention

“learn a map from encoder vectors to decoder vectors—working out the right place to pay attention by learning it”

\begin{equation} e_{i} = s^{T} W h_{i} \end{equation}

limitation: lots of parameters to learn in \(W\) for no good reason

reduced-rank multiplicative attention

\begin{equation} e_{i} = s^{T} Q R h_{i} = (Q s^{T})^{T} (R h_{i}) \end{equation}

essentially, why don’t we project \(s\) and \(h\) down to smaller dimensions before the dot product is taken?

this motivates also transformers

additive attention

\begin{equation} e_{i} = v^{T} \text{tanh} \qty(W_1 h_{i} + W_{2} s) \end{equation}

where \(v\) and \(W_{j}\) are learned.