Benchmark tradeoffs
- baseline too high: no one can beat it
- baseline too low: no differentiation
Close-ended evaluation
- do standard ML (“accuracy”)
- because there’s one of a few known answers
- types of tasks: SST, IMDP, Yelp; SNLI
Most common multi-task benchmark: SuperGLUE
Difficult
- what metrics do you choose?
- how to aggregate across metrics (average?)
- label statistics
- spurious correlations
Open-ended evaluations
- long generations with too many correct answers (can’t directly apply classic ML)
- there are better and worse answers (relative)
Content Overlap Metrics
compare lexical similarity between generated and gold text:
usually n-gram overlap metrics
(BLEU (usually considered a precision metric), ROUGE (usually considered a recall metric), METEOR, CIDEr, etc.)
- doesn’t consider semantic relatedness
- but is fast!
Semantic metrics
- BERTSCORE: get contextual embeddings of a sequence using a Bert, do some contextual smart averaging things
- Word Embeddings: averaging all the embeddings and compare them
- BLEURT: pretrain Bert, continual pretrain a Bert on BLEU, then fine tune on human annotation data
Model Based Metrics
AlpacaEval and MT-Bench: asking GPT4 to scoring a particular sample.
- self bias worries
- length normalization
Humans!
automatic evaluations need to compared against what humans could have. “ask humans to evaluate some axis (“fluency”, “coherence”, etc.)”
- slow
- expensive
- inter-annotator disagreement
- intra-annotator (time) disagreement
- not reproducable
- is a measure of precision, not recall