Some ideas of model validation
Cross Validation
Hold-out cross-validation
For instance, you can do:
- 70% for training
- 30% hold out cross validation for testing
But at very large dataset scales, the validation size can be capped at a fixed size (so you can hold out like 0.1% or something but still have 10k samples).
k-fold cross validation
- shuffle the data
- divide the data into \(k\) equal sized pieces
- repeatedly train the algorithm on 4/5 of the data, test on remaining 1/5
In practice people do 10 folds.
LOOCV
See Leave-One-Out Cross Validation
Test Set
For academic settings, not for production, we can report the result on a third unbiased estimate of the dataset.
LLM Evaluation Types
Intrinsic Evaluation
In-Vitro Evaluation or Intrinsic Evaluation focuses on evaluating the language models’ performance at, well, language modeling.
Typically, we use perplexity.
- directly measure language model performance
- doesn’t necessarily correspond with real applications
Extrinsic Evaluation
Extrinsic Evaluation, also known as In-Vivo Evaluation, focuses on benchmarking two language models in terms of their differing performance on a test task.