Model Evaluation

Some ideas of model validation

Cross Validation

Hold-out cross-validation

For instance, you can do:

70% for training
30% hold out cross validation for testing

But at very large dataset scales, the validation size can be capped at a fixed size (so you can hold out like 0.1% or something but still have 10k samples).

k-fold cross validation

shuffle the data
divide the data into \(k\) equal sized pieces
repeatedly train the algorithm on 4/5 of the data, test on remaining 1/5

In practice people do 10 folds.

LOOCV

See Leave-One-Out Cross Validation

Test Set

For academic settings, not for production, we can report the result on a third unbiased estimate of the dataset.

LLM Evaluation Types

Intrinsic Evaluation

In-Vitro Evaluation or Intrinsic Evaluation focuses on evaluating the language models’ performance at, well, language modeling.

Typically, we use perplexity.

directly measure language model performance
doesn’t necessarily correspond with real applications

Extrinsic Evaluation

Extrinsic Evaluation, also known as In-Vivo Evaluation, focuses on benchmarking two language models in terms of their differing performance on a test task.