Houjun Liu

SU-CS120 OCT082024

  • evaluation is quite hard—you need

Classical Test Theory

  • “just average each test” (think MUC, b3, etc.)
  • test-dependent ability estimation
  • BAD: because each test maybe different difficulty

Item Response Theory (IRT)

  • model item and test taker characteristics
  • test-invariant ability estimation (subset invariant)
  • adaptive testing

problem

  • requires calibration first
  • …which is quite costly

Flash-HELM

HELM, prioritizing higher-ranked models. Evaluate good model more.

Sang’s Method

We want to estimate \(\theta\) with a budget of \(K\) questions.

  • test taker ability is fixed, but unknown: \(\theta \sim p(\theta)\)
  • there’s some function \(z(q) \to Z \in \triangle\), for some question \(q \in Q\)
  • our response model, then, is \(p(y=1 | z; \theta) = \sigma(\theta - z)\)

Then for ever question we have we ask what the fisher information is.

You then update for every test result the response model using MLE.

amortized calibration

  • compute the calibration difficulty \(z\)

advantages

  • more reliable and efficient across emprical setting
  • incorporates amortized (learned) calibration to reduce calibration costs
  • introduces conditional question generation to generate questions of specific difficulties