Model Evaluation

Some ideas of model validation

Cross Validation

Hold-out cross-validation

For instance, you can do:

  • 70% for training
  • 30% hold out cross validation for testing

But at very large dataset scales, the validation size can be capped at a fixed size (so you can hold out like 0.1% or something but still have 10k samples).

k-fold cross validation

  1. shuffle the data
  2. divide the data into \(k\) equal sized pieces
  3. repeatedly train the algorithm on 4/5 of the data, test on remaining 1/5

In practice people do 10 folds.

LOOCV

See Leave-One-Out Cross Validation

Test Set

For academic settings, not for production, we can report the result on a third unbiased estimate of the dataset.

LLM Evaluation Types

Intrinsic Evaluation

In-Vitro Evaluation or Intrinsic Evaluation focuses on evaluating the language models’ performance at, well, language modeling.

Typically, we use perplexity.

  • directly measure language model performance
  • doesn’t necessarily correspond with real applications

Extrinsic Evaluation

Extrinsic Evaluation, also known as In-Vivo Evaluation, focuses on benchmarking two language models in terms of their differing performance on a test task.