- evaluation is quite hard—you need
Classical Test Theory
- “just average each test” (think MUC, b3, etc.)
- test-dependent ability estimation
- BAD: because each test maybe different difficulty
Item Response Theory (IRT)
- model item and test taker characteristics
- test-invariant ability estimation (subset invariant)
- adaptive testing
problem
- requires calibration first
- …which is quite costly
Flash-HELM
HELM, prioritizing higher-ranked models. Evaluate good model more.
Sang’s Method
We want to estimate \(\theta\) with a budget of \(K\) questions.
- test taker ability is fixed, but unknown: \(\theta \sim p(\theta)\)
- there’s some function \(z(q) \to Z \in \triangle\), for some question \(q \in Q\)
- our response model, then, is \(p(y=1 | z; \theta) = \sigma(\theta - z)\)
Then for ever question we have we ask what the fisher information is.
You then update for every test result the response model using MLE.
amortized calibration
- compute the calibration difficulty \(z\)
advantages
- more reliable and efficient across emprical setting
- incorporates amortized (learned) calibration to reduce calibration costs
- introduces conditional question generation to generate questions of specific difficulties